[Expo-tech] loser git conversion

Wookey wookey at wookware.org
Wed May 6 01:53:35 BST 2020


On 2020-05-03 17:28 +0100, Mark Shinwell wrote:

> In fact if you want to look at something else, there are some problems later
> on, where some ARGE data was brought in containing multiple years' surveys in
> the same hg commits. I fear this may be quite extensive. 

OK. so I looked through all the tags to see how may 'wrong year' files
were added in each year. Details are below. Then I started to compare
with the hg datasets to see how different it was from before.

Marking the post-<year> tags in hg to match the git ones reveals one
very odd thing: post-2012 is the commit _after_ post-2013 in hg. (on
different branches). So it looks like both years were sorted at the
same time on parallel branches then merged.

Then I realised that the new git repo has a completely linear history,
so we've lost this sort of info of what was done on branches and them
merged. Was that done on purpose or is it a feature of the conversion?
Wouldn't it be better to just keep the rather messy structure as-is?
And only change things that we have a reason to change?

I can see that that linearisation would probably cause the conflicts you saw.

It's OK to linearise so long as we end up effectively rebasing the
data into a historical order that makes sense, but it's probably quite
easy to mess up here.  We can't preserve both 'order things actually
happened in the dataset' and 'order things happened on the
ground'. e.g. when someone finds a survey from a few years ago and
puts it in. I think it makes sense to go with 'happened on the ground'
in general, but if say a 2012 new cave is added to the website and
dataset in 2015, if we move the loser data to its 'correct' 2012 spot,
now when someone sees the 2015 website checkin, and looks in 2015 for
corresponding dataset changes they won't find it, and will be
surprised to find the data's been there for 3 years. Bit
confusing. The point being that when rummaging through our history to
work out what hapenned this information isn't totally valueless.

Anyway back to the status of 'wrong year files'.  I've not yet
finished the process of comparing with hg but this issue of 'are we
really removing all the branching info?' seems to me to be important
so posting now and I'll send another mail when I have more info.

So every year has lots of edits of old files - usually just adding
exports, but also adding LRUDs and correcting cock-ups. Fortunately we
are just looking for new files, with dates in wrong year. This seems to work quite well:
git diff --name-status post-2014 post-2015 | grep ^A | cut -f 2 | xargs grep date | grep -v 2015

Going through the lot we find that a few years are kosher, but most
add some old surveys. Biggest cases are usually ARGE catch-up dumps,
and those are the ones that mess with our SMKsystem lengths.

There are also quite a lot of years with surveys from the future added
in them, which really doesn't seem right. Possibly merging in newer
version of files than should be there, as they are usually in arge
'several surveys in one file' files.

Quite a few are someone getting round to adding an older survey for a
minor cave. It would be nice if they were in the correct year, but
it's not all that important on the scale of things.

post-2004 adds:
2013! two connecting-leg surveys! in caves/107/107.svx
2002 caves/2002-07/2002-07.svx
2002 caves/2002-08/2002-08.svx
2002 caves/2002-w-02/2002-w-02.svx
2000 caves/220/220.svx
2002 caves/234/trunk.svx
2002 caves/238/238.svx
2002 caves/quarriesd/quarriesd.svx

post-2005 adds:
1990 caves/225/225.svx
2004 13 files in caves/204/subsoil/*

post-2006 is all good.

post-2007 huge list of stuff.
in 40, 41, 78, 87, 115, 143, 144, 145, 152.
mostly very old (70s, 80s(files moved about?)
2010 and 2011 files in 143
2012 143/rampe.svx
2010+2011 143/canyon.svx
Some of that really looks like it should be sorted on. Files from 5 years in the future has got to be a mistake.

post-2008 all good

post-2009 has two 258 surveys from previous year
2008 caves/258/stonemonkey/stonemonkey3.svx
2008 caves/258/stonemonkey/stonemonkey4.svx

post-2010 is all good

post-2011
one new cave from previous year:
2010 caves/2010-07/2010-07.svx
21 caves/143 surveys from 2006 to 2012 (in 7 files)
presumably an ARGE import

post-2012
more ARGE imports (115, 142, 143, 32, 40)
2009-2015.
some 1980s cucc stuff in 115 (moved I suspect)
A few new caves from 2005, 2006, 2011
2005 caves/233/nachbarschacht.svx:
2006 caves/2006-73/2006-73.svx:
2011 caves/258/suicidalvampyre.svx:

Again the future files really are suspicious.

post-2013
2003 caves/247/247.svx
2007 caves/260/260.svx
2014 and 2016 caves/32/forever.svx
107 files from 2012 and 2015

More future-files?


post-2013 to 2014 diff adds:
caves/107/wiggly.svx:*date 2013.08.14 (4 legs of grade1 in 107)
minor, but I moved it anyway.
and
2015 caves/40/arge/silbercanyon.svx

Looks like it was a classic 'included but survex file not committed'
case, so there are commits for adding it, commenting it out because
it's breaking the dataset, checking it in and adding the include back
in. Original include of course mixed with other stuff. Fiddly to
untangle (and bin the now-moot-and-empty comment/uncomment commits

post-2014 to 2015 adds some late surveys
2014 caves/264/noserock.svx
2014 caves/264/digdug.svx
2013 caves/107/oldriftconn.svx
2012 caves/2015-mf-06
2006 caves/B4/B4.svx
2002 caves/1626/5.svx 
1992 caves/BS17/organfake.svx

adding new caves with old data from other people to the dataset is
fair enough - that's history.
Getting round to entering data from our own separate minor caves (B4, 2015-mf-06)
is slack but better late than never and I'm not sure it's worth moving them.
the 107 survey is only 2 legs, but noserock is a proper survey.

Reason for breakage is digdug was included twice

post-2016
2012 caves/2012-js-1/2012-js-1.svx
2007,2015 caves/271/271.svx
15 2014 and 2015 surveys in 158, from ARGE

post-2017 is all good

2017-2018 is too hard to check because everything moved so everything is 'new'.

post-2019 seems all kosher.




Wookey
-- 
Principal hats:  Linaro, Debian, Wookware, ARM
http://wookware.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.wookware.org/pipermail/expo-tech/attachments/20200506/20c87f1c/attachment.sig>


More information about the Expo-tech mailing list