[Expo-tech] wonderfully complex bug in encodings

Thu May 7 23:49:27 BST 2020

On 2020-05-07 21:15 +0100, Philip Sargent (Gmail) wrote:
> tl;dr
> So something about the combination of the _actual_
> character encoding, the declared character encoding, and the doctype
> makes the body sufficiently invalid that django/troggle fails to
> deliver it.
> 
> resetting 
> charset=iso-8859-1
> instead of 
> charset=utf-8 
> fixed it.

Hmm. I thought I tried that and it didn't seem to work. Ah I missed
a dash out. 'iso8859-1'.

> - - -- - -- - -- - -- - -- - -- - -- - -- - -- - -- - -
> OK, that was undoubtedly my fault because I rearranged all the geology
> articles in January.

So you changed the charsets metadata without actually checking what charset was
in use? Not likely to end well.

> Crying out for an automated checker,

Doesn't really need to be automated. Just run it. Make sure stuff is
right. Everything new is very likely to be utf-8.

> perhaps as part of the thing that turns
> CRLF
> into LF which we run intermittently.
> http://expo.survex.com/handbook/computing/regular.html
> 
> or an automated thing whenever a html file is pushed to the server?

I guess some post-hook checks (especially for mixed CRLF) would be a
good idea. The facility has been there for at least a decade...Any VCS
can do this.

> There are 30 html files with \xfc in them. So yes these needs fixing 
> by making sure that charset=iso-8859-1
> 
> Longer term, troggle should be fixed. Or a later version of django might fix
> it.

It may not be strictly broken. Exactly what it does about invalid
input is kind of up to it. It could be more helpful.

Python3 may well change things - the string handlin does a much better
and radically different job. But it'll be a while before we get there. 

> grep -rcP "\xf6" * | grep -v 0$ | grep htm

I don't think that is doing what you think.

So far as I can tell file 1623/110/l/entrance.html does not contain
the byte F6. It does contain two utf-8-encoded u-umlauts (encoded as
'C3 B6'). But I think the file is being de-utf-8ed before grep gets to check,
so it then finds 'F6' (which is 'correct' in the display charset).

So that's a terrible check for 'might this file be iso-8859-1 really'?

I've not checked all the others.

'file' correctly identifies 'geolog.htm as 'ASCII' and
1623/110/l/entrance.html as utf-8, so I suggest you use that.

dos2unix/unix2dos is great for translating CRLF <-> LF but I never
found a way to get it to just tell me which flavour it thought a file
was. For a post-commit script we don't care - just run that
command. It claims to not break binary files. 

Seems it has a -D option now to just show you what it thinks a file is, but only on Windows.

So
find * -print0 | xargs -0 dos2unix -i | grep -v binary > /tmp/dosfiles
will get you a list of all the text files (and any misidentified binary files)
and show which have dos lineends. I see there are no misidentified binary files as
text, which is good and suggests that it would be safe as a post-commit script.

Turns out there are quite a few, a fair number of which look like we
can blame you for :-) They should probably all be fixed, (just run the
above command without -i).  but I'd check that there aren't good
reasons first. (emacs is a good tool as it hides lineends unless they
are mixed in which case it shows them). 

It's possible that mixed lineends is acceptable in an SVG file but
probbaly still not a good idea?

These are mixed-endings which is the thing you _really_ want to avoid:
years/2017/ukcaving/index.html
years/2011/firstaidrequirements_austria.html
years/1977/report.htm
update.htm
handbook/druginfo.html
folk/to-do/readme.html
entrance_data/1623-CUCC2015DL02.html
entrance_data/1623-277.html
entrance_data/1623-CUCC2015DL01.html
entrance_data/1623-2001-02.html
cave_data/1623-2004-18.html
cave_data/1623-2012-dd-08.html
cave_data/1623-B4.html
cave_data/1623-2018-aa-01.html
cave_data/1623-CUCC2015DL02.html
cave_data/1623-CUCC2015DL01.html
cave_data/1623-2001-02.html
1623/277/277plan.svg

All these are DOS-only:
1623/234/surveys/234_2008-rl.xml
1623/204/qm.csv
cave_data/1623-2012-dd-05.html
documents/bierbook/names.txt
documents/bierbook/dates.txt
documents/bierbook/seshbook.tex
documents/bierbook/readme.txt
documents/therionprotractors/therionpage.tex
entrance_data/1623-264.html
folk/README
folk/folk.csv
handbook/troggle/menudesign.html
handbook/computing/todo-styles.css
handbook/computing/ftpusage.html
noinfo/scripts/loser-caves1624-raw-data/Uebersicht_2
scripts/detect-filename-clashes.py
years/2018/theasiancookshop-order.html
years/2018/theasiancookshop-order2.html
years/2018/stuffbought.html
years/2011/index.html
years/2009/index.html
years/2006/logbook.html
years/2006/stuffinaustria.txt
years/2008/topcampbivvylist.html
years/2008/suggestions.html
years/2007/grants.html
years/2007/Expo07frm.rtf
years/2007/expo07shirtfront.svg
years/2007/expo07shirtback.svg
years/2007/links.html
years/2007/stuffinaustria.txt
years/2002/goals.htm
years/2014/index.html
years/2014/basecamplist.html
years/2014/foodbought.html
years/2013/basecamplist.html
years/2010/index.html
years/2010/basecamplist.html
years/2010/topcamplist.html

Now that's two nights running you've distracted me from doing loser dataset stuff...

Wookey
-- 
Principal hats:  Linaro, Debian, Wookware, ARM
http://wookware.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.wookware.org/pipermail/expo-tech/attachments/20200507/48e21efb/attachment.sig>