[Expo-tech] wonderfully complex bug in encodings

Philip Sargent (Gmail) philip.sargent at gmail.com
Thu May 7 21:15:33 BST 2020


tl;dr
So something about the combination of the _actual_
character encoding, the declared character encoding, and the doctype
makes the body sufficiently invalid that django/troggle fails to
deliver it.

resetting 
charset=iso-8859-1
instead of 
charset=utf-8 
fixed it.
- - -- - -- - -- - -- - -- - -- - -- - -- - -- - -- - -
OK, that was undoubtedly my fault because I rearranged all the geology
articles in January.

Crying out for an automated checker, perhaps as part of the thing that turns
CRLF
into LF which we run intermittently.
http://expo.survex.com/handbook/computing/regular.html

or an automated thing whenever a html file is pushed to the server?

There are 30 html files with \xfc in them. So yes these needs fixing 
by making sure that charset=iso-8859-1

Longer term, troggle should be fixed. Or a later version of django might fix
it.

grep -rcP "\xf6" * | grep -v 0$ | grep htm

1623/110/l/entrance.html:2
1623/110/l/entrancearea.html:2
1623/264/ent.html:9
cave_data/1623-1996WK11.html:1
cave_data/1623-2002-AD-04.html:1
cave_data/1623-2004-01.html:1
cave_data/1623-2004-03.html:1
cave_data/1623-2011-01.html:2
cave_data/1623-2012-dd-05.html:1
cave_data/1623-2012-ns-10.html:1
cave_data/1623-2017-cucc-28.html:1
cave_data/1623-264.html:1
cave_data/1623-267.html:1
cave_data/1623-271.html:1
cave_data/1623-273.html:1
caves-tabular.htm:6
entrance_data/1623-1996WK11.html:1
entrance_data/1623-2009-03.html:1
entrance_data/1623-2012-dd-05.html:1
folk/l/natdalton.htm:3
handbook/computing/todo-data.html:8
others/arge/index.html:1
years/2008/mission.html:1
years/2011/descentarticle.html:11
years/2013/logbook.html:1
years/2015/logbook.html:27
years/2016/logbook.html:11
years/2017/logbook.html:7
years/2017/ukcaving/index.html:7
years/2018/logbook.html:1

-----Original Message-----
From: Expo-tech [mailto:expo-tech-bounces at lists.wookware.org] On Behalf Of
Wookey
Sent: 07 May 2020 01:32
To: expo-tech at lists.wookware.org
Subject: Re: [Expo-tech] peculiar file not visible problem

On 2020-05-06 22:41 +0100, Philip Sargent (Gmail) wrote:
> http://expo.survex.com/geolog.htm
> appears as an empty file.
> But it is definitely there on the server when I login via ssh:
> 
> ls -tlrgaA geol*
> -rw-rw-r-- 1 www-data   5993 Apr 18 04:23 geolog2.htm
> -rwxrwxr-x 1 www-data 403865 Apr 18 12:21 geology.jpg
> -rw-rw-r-- 1 www-data   7936 May  6 22:37 geolog.htm
> pwd
> /home/expo/expoweb
> 
> git status says everything is clean.
> 
> What's going on ?

Good question. I can't see what's wrong either. renaming it doesn't
help. Neither does removing the header. I suspected the troggle url
table from stealing the name but I can;t see anything that would and
geolog2.htm works. Nothing obvious in the apapche log because troggle
made upa blank page and delivered that so apache is happy.

I presume there is some troggle debug that would help get to the bottom of
this.

It's not saying 'page not found' so it is finding the file. It's just
choosing not to display anything.

moving </head> down two lines (two below </body> making invalid html)
makes the page all appear. (because </head> is deemed invalid by the
browser and ignored - or at least firefox prints it in red, which is
what I take that to mean. Normally the <html> at the top is red,
suggesting there is something wrong with the basic document structure)
when it all appears there is a 'Page could not be split into header
and body' message, suggesting that essentially this page is only
displaying the header, not the body, so making it all look like header
means you see it all.

OK. Took a while but it was because the page contained characters <F6>
(o-umlaut) and <FC> (u-umlaut) in the body (1st para). delete those
and it appears. So something about the combination of the _actual_
character encoding, the declared character encoding, and the doctype
makes the body sufficiently invalid that django/troggle fails to
deliver it (unless it is declared a header in which case it just sends
it as is).  I'm not sure what the right fix is. You could just use &
entities instead, but understanding it is probably a good idea. There
may be other pages with the same problem.

I use katast.htm for comparison as it's _extremely_ similar.

I noticed the x bit set on geology.jpg, so I fixed up all the spurious
x bits instead, and will leave this mystery for later.

Wookey
-- 
Principal hats:  Linaro, Debian, Wookware, ARM
http://wookware.org/

_______________________________________________
Expo-tech mailing list
Expo-tech at lists.wookware.org
https://lists.wookware.org/cgi-bin/mailman/listinfo/expo-tech




More information about the Expo-tech mailing list