[Expo-tech] files to munge whilst converting VCS

Wookey wookey at wookware.org
Tue Apr 14 04:38:32 BST 2020


On 2020-04-13 11:13 +0100, Mark Shinwell wrote:
> On Mon, 13 Apr 2020 at 00:54, Wookey <wookey at wookware.org> wrote:
> 
>     I'm also resizing overly large images (few need to be bigger than
>     1024x768, and we have stuff up to 6000x4000) and PDFs which have much
>     higher than printer resolution (usually due to lots of included large
>     images).
> 
> 
> This bit seems like a mistake to me, for the following reasons:
> 
> - We are losing information that we might, one day, want to use (perhaps in a
> context we hadn't thought about yet).  To me this is the overriding reason not
> to do this.  I also know how keen you are on preserving data so I'm maybe
> missing something here.

I'm not throwing away anything we haven't got elsewhere. See reply to phil for details.

> - The resolution of 1024x768 (even more so for 640x480) hasn't been relevant
> for a very long time now :)  Reducing images to this size makes them appear
> frustratingly small on modern high-resolution displays (for example on the
> machine I'm typing this email on, an image of that size would be barely 10cm
> across on a 16" screen).  It also makes any image, that might have been
> suitable for printing, now unsuitable for printing.

You are confusing display size with native resolution, which are no
longer fixed to be the same thing. Some size around that is plenty to
see detail of what the photo was about on the website: cave location,
entrance, people, views etc. And we have the originals to re-do if one
day in the future someone decides higher native web-resolution is
appropriate.

There is no point having a few recent pics at massive resolution when
the vast bulk of entrance pics are 800x600 or so. Keeping images under
1MB seems about right for website images.
 
> - I would be concerned about rewriting PDF files simply because it's prone to
> going wrong (inherently difficult, writers/readers being non-compliant with the
> standard, etc).  Are you sure they haven't been broken?

Yes. I looked. And compared image fidelity to check I wasn't
over-compressing. It turns out this is only being done to 2 files in
the end where it was actually beneficial. years/2009/ExpoReportCUCC.pdf goes
from 10Mb to 1Mb.

> As a more general point, all this work really is appreciated Wookey, but I do
> wonder if you're making life harder for yourself than needs be :)  I would have
> thought a no-frills conversion in this case would have been sufficient in the
> first instance and much less labour intensive.

It would but I've done it now (well almost - one bug to fix - I can
still revert to the plain conversion if it proves too tiresome). It
really is worth throwing out things like the deleted 83MB file at the very least.
 
> One other thing that comes to mind is that with git it is possible to make
> clones without cloning all the history, which might be useful in certain
> circumstances.  In particular this reduces further any need to be concerned
> about repository size, I think.

Yep. Could be useful. I should experiment with the effects of
--depth. Could be good if you really do just want 'current' and we
should perhaps recommend it to typical expo users - not sure how
useful history is to the average user? Does that 'depth' set the number
of commits to each file brought in?

> Have you started converting loser?  If not I'm happy to try doing that
> including the CVS changesets.

I have done a vanilla hg-fast-export, but that doesn't do either of
the tricky bits.

1) putting the CVS changesets on the pront of the history.

2) putting in the 2015/2016 ARGE changes in the right place in history
so that it is still possible to check out a 'post-2015' and
'post-2016' dataset which actually contains the cave-as-surveyed
then. (We didn't get that data until 2017)

Again we only get once chance to fix this, so now seems like the
time. I can dig out the patches for you.

So yeah, feel free to have a go, as I've not worked out how to do the CVS->git part yet.

You want a script something like this:

#!/bin/bash

gitrepo=git/loser
hgrepo=hg/loser
tmpfsfile=/dev/shm/migrate/scratch
startdir=~wookey/docs/caving/expo

#start in the right place:
cd $startdir
#echo "removing previous goes"
rm -r git/migration
rm -r $gitrepo

#ensure hg repo is uptodate
(cd $hgrepo; hg pull; hg update; hg commit -y)
mkdir $gitrepo
test -e $gitrepo || exit 1
cd $gitrepo
git init
echo -n "importing..."

#something in here to import CVS stuff...

#hg-fast-export in bionic has to be modified: https://github.com/samgiagtzoglou/fast-export/commit/68e5b1d072a282f92c78134753ac4d1c50ac6cdc
##to work on mercurial 4.5
hg-fast-export -r ../../$hgrepo -A authors
echo "done"
git checkout loser
echo repo original size: $(du -sh)
git commit -a -m "fixup files with execute bits set"

#tidy up cruft
git for-each-ref --format="%(refname)" refs/original/ | xargs -r -n 1 git update-ref -d
git reflog expire --expire=now --all
git gc --prune=now --aggressive

Authors file attached (normalises hg commit names to git ones)


Wookey
-- 
Principal hats:  Linaro, Debian, Wookware, ARM
http://wookware.org/
-------------- next part --------------
"aaron"="Aaron Curtis <aaron.curtis at cantab.net>"
"AaronCurtis"="Aaron Curtis <aaron.curtis at cantab.net>"
"Aaron Curtis <aaron.curtis at cantab.net>"="Aaron Curtis <aaron.curtis at cantab.net>"
"aarongc at jacob"="Aaron Curtis <aaron.curtis at cantab.net>"
"aaron at localhost"="Aaron Curtis <aaron.curtis at cantab.net>"
"aaronOnJacob"="Aaron Curtis <aaron.curtis at cantab.net>"
"aaron at seagrass.goatchurch.org.uk"="Aaron Curtis <aaron.curtis at cantab.net>"
"anthon"="Anthony Day <ajday1973 at gmail.com>"
"anthony"="Anthony Day <ajday1973 at gmail.com>"
"Anthony Day"="Anthony Day <ajday1973 at gmail.com>"
"BeckaLawson"="Rebecca Lawson <beckalawson at gmail.com>"
"becka"="Rebecca Lawson <beckalawson at gmail.com>"
"Becka"="Rebecca Lawson <beckalawson at gmail.com>"
"Chris Densham"="Chris Densham <chris.densham at stfc.ac.uk>"
"dave"="David Loeffler <d.loeffler.01 at cantab.net>"
"dl267"="David Loeffler <d.loeffler.01 at cantab.net>"
"Duncan Collis"="Duncan Collis <duncan.collis at gmail.com>"
"DWalker <dw444 at cam.ac.uk>"="David Walker <dw444 at cam.ac.uk>"
"emma"="Emma Wilson <emma.d.wilson at gmail.com>"
"expo computer"="expo computer"
"expo at expobox.potato.hut"="expo computer"
"expo<expo at expobox.potato.hut>"="expo computer"
"expo"="expo user"
"expo laptop"="expo computer"
"expolaptop"="expo computer"
"expoonserver"="expo server"
"ExpoOnServer"="expo server"
"expo at seagrass.goatchurch.org.uk"="expo server"
"expouser"="expo user"
"expouser <wookey at wookware.com>"="expo computer"
"Frank Tully"="Frank Tully <franktully1 at gmail.com>"
"george"="George Breley <george.breley at hotmail.com>"
"goatchurch at goatchurch-PC"="Julian Todd <julian at goatchurch.org.uk>"
"goatchurch"="Julian Todd <julian at goatchurch.org.uk>"
"goatchurch at ubuntu.clocksoft.dom"="Julian Todd <julian at goatchurch.org.uk>"
"goatchurh"="Julian Todd <julian at goatchurch.org.uk>"
"gopatchurch"="Julian Todd <julian at goatchurch.org.uk>"
"jacobpodesta29 at gmail.com"="jacob Podesta <jacobpodesta29 at gmail.com>"
"Jenny"="Jenny Black <jenelopy at gmail.com>"
"julian"="Julian Todd <julian at goatchurch.org.uk>"
"Luke Stangroom"="Luke Stangroom <lukestangroom at gmail.com>"
"Lydia Leatheri"="Lydia Leather"
"mark"="Mark Shinwell <mshinwell at gmail.com>"
"Martin Green"="Martin Green <martin.speleo at gmail.com>"
"MartinGreen"="Martin Green <martin.speleo at gmail.com>"
"Martin"="Martin Green <martin.speleo at gmail.com>"
"martin.speleo at gmail.com"="Martin Green <martin.speleo at gmail.com>"
"martinspeleo"="Martin Green <martin.speleo at gmail.com>"
"Michael at Laptop"="Michael Sargent <mjs244 at cam.ac.uk>"
"michael-laptop"="Michael Sargent <mjs244 at cam.ac.uk>"
"Michael"="Michael Sargent <mjs244 at cam.ac.uk>"
"mjg54"="Martin Green <martin.speleo at gmail.com>"
"mshinwell"="Mark Shinwell"
"Nadia"="Nadia Raeburn-Cherradi"
"noel"="Noel Snape <noelsnape at googlemail.com>"
"olaf"="Olaf Kähler <caving at island-olaf.de>"
"olly"="Olly Betts <olly at survex.com>"
"olly at survex.com"="Olly Betts <olly at survex.com>"
"pjrharley"="Pete Harley"
"rlawson at liv.ac.uk"="Rebecca Lawson <beckalawson at gmail.com>"
"Sam <sam at wenhams.co.uk>"="Sam Wenham <sam at wenhams.co.uk>"
"sb476"="Stuart Bennett <sb476 at cam.ac.uk>"
"serChris Densham"="Chris Densham <chris.densham at stfc.ac.uk>"
"ser=Martin Green"="Martin Green <martin.speleo at gmail.com>"
"serMartin Green"="Martin Green <martin.speleo at gmail.com>"
"Stangroom"="Luke Stangroom <lukestangroom at gmail.com>"
"substantialnoninfringinguser"="Aaron Curtis <aaron.curtis at cantab.net>"
"substantialnoninfringinguser at gmail.com"="Aaron Curtis <aaron.curtis at cantab.net>"
"Thomas Starnes"="Thomas Starnes <thomstarnes at gmail.com>"
"Todd"="Julian Todd <julian at goatchurch.org.uk>"
"wookey"="Wookey <wookey at wookware.org>"
"wookey <wookey at wookware.org>"="Wookey <wookey at wookware.org>"
"Wookey<wookey at wookware.org>"="Wookey <wookey at wookware.org>"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.wookware.org/pipermail/expo-tech/attachments/20200414/c2e75cce/attachment-0001.sig>


More information about the Expo-tech mailing list