[Expo-tech] loser git conversion

Sun May 3 18:27:14 BST 2020

On 2020-05-03 17:28 +0100, Mark Shinwell wrote:
> I'll do some more tonight, so hang on!  I'll probably concentrate on looking at
> the "gap" first.

OK. Have fun :-)

> My initial impression of the gap was that it looked like a big import of a
> working copy.  There were a lot of temporary files, editor backup files, etc.
> brought in at that point (which I removed in the git import I thought, but your
> email makes me think I missed some -- will look).  I wonder if this might have
> been after a "merge" with the ARGE data somehow?  Or maybe a big commit from
> the on-expo machine or similar?

Not sure exactly. The main thing is that there was a big dollop of
changes that came from arge. It may have been the first time we tried
to properly merge our datasets. Made more confusing by getting rid of
the leading zeros on two-digit caves (This is odd - I was pretty sure
we never had leading zeros - only ARGE did, but the CVS says
otherwise). Anyway this did actually happen, so it's fine to be in the
history.

It's quite possible that we had changed all the dates to ISO and added
*titles (and maybe didn't have lots of trailing whitespace), then ARGE
moveed the files round and that all got reverted. If so that's fine to
have in the history. And maybe some of this stuff never got tidied
again (the dates at least must have been) so it should be saved up to
the end of history for improving the dataset.

> Probably easiest for you to do things on separate branches for the moment, and
> I will update the master accordingly.

OK

> In fact if you want to look at something else, there are some problems later
> on, where some ARGE data was brought in containing multiple years' surveys in
> the same hg commits.  I fear this may be quite extensive.  Unfortunately I've
> forgotten the revision that I saw this at now, but I think it was some 2014
> data present in a later commit including other (later) updates too.

OK. I'm reasonably familiar with that, so will take a look.

> Disentangling (or even maintaining this going forward) all seems like a lot of
> work and I'm starting to wonder if it's worth it.  To keep the "yearly
> checkpoint" scheme working, we need to make sure that changesets don't touch
> multiple years at once, and that any subsequent change to a previous year gets
> rebased to a point between the appropriate two yearly tags.  I fear this is
> going to cause a lot of rebasing and changing of revision numbers which may
> cause a nuisance and potentially trouble -- it will also likely complicate any
> proposals for non-nerd-friendly workflows.  What tangible benefits do we
> actually get from having this property?

The main benefit is that we can get accurate lengths for each
year. People are very interested in cave lengths and want them every
year, so being able to provide ones which do actually reflect 'survey
reality' is a good thing.

This can also be used to generate things like animations of
discoveries over time.

It's not the end of the world if it's broken back in these early
revision years, but the data must have worked at the time so we should
make it so that at least at that one point it processes OK. Having
come this far and gone to the effort of bringing the CVS history back
in, I think it's worth trying to finish this off. I think it's really
quite important to have each year both processable and historically
accurate for recent years, so whilst we could punt on 2003 if it
really is too much of a mess, I think we should get the other years
working, otherwise it's a regression over the hg repo.

As for future rules, we can discuss whether to allow rebases on this
data after this conversion is made public (I was assuming not, as
users need to know when a repo may be rebased). So yes there is a risk
that some data will be checked in (or connected) in the 'wrong'
year. Somtimes that's what actually happens (e.g. homecoming (?) being
missing its entrance survey after 2019 so its 'survey length' is 0
even though it does have one and we have quite a lot of surveys for
it. There is of course a distinction between checking data in in the
right year and connecting it so it processes. 

But lets put future use of the dataset aside for now, and just get it
into a historically-valid state whcih is at least no worse than the
existing hg loser repo.

Wookey
-- 
Principal hats:  Linaro, Debian, Wookware, ARM
http://wookware.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.wookware.org/pipermail/expo-tech/attachments/20200503/8c8c2a62/attachment-0001.sig>