head	1.3;
access;
symbols;
locks; strict;
comment	@# @;


1.3
date	2014.03.06.05.24.08;	author dholland;	state Exp;
branches;
next	1.2;
commitid	90Sbf6gnlCCG9Brx;

1.2
date	2013.05.24.08.25.11;	author wiz;	state Exp;
branches;
next	1.1;
commitid	uO830S8w88aQlRQw;

1.1
date	2013.05.24.00.41.31;	author dholland;	state Exp;
branches;
next	;
commitid	1ggxzJuBByZCMOQw;


desc
@@


1.3
log
@a couple minor adjustments, sitting around since last july
@
text
@The material herein is grouped first by topic and then by priority.

------------------------------------------------------------

1. Operational model

- Centralized operation with one master tree
- Supports disconnected operation
- No compare-by-hash
- Native support for synced slave copies of the master tree (like anoncvs)
- Transport-independent remote operation, supporting both http/https
  and ssh
- Checkouts can cache arbitrary amounts of history locally but are not
obliged to clone everything
- Non-committers with readonly checkouts should be able to package
changesets for review and commit by committers.

Rationale:

Centralized operation is posed as a design requirement because it's a
prerequisite for other things... and because this whole project is
predicated on the assumption that centralized operation is acceptable.
If someone comes up with a clever way to support distributed operation
without compromising other requirements, well and good; otherwise one
may as well use one of the modern distributed version control systems.

Disconnected operation, meanwhile, covers most of the use cases people
cite in favor of distributed version control.

Compare-by-hash is bad not because it's slightly sleazy, or because
the statistical assumptions about the probability of collisions are
wrong (although in some contexts they're questionable) -- it's because
cryptographic hash functions don't age well and the standard DVCS
scheme for hashing chains of versions doesn't provide any decent way
to migrate an existing repository to a new hash function.

Native support for synced slave copies is needed in order to be able
to provide anonymous access (like anoncvs) without needing access to
the master tree. This is also meant to satisfy the use cases where
currently people rsync the whole CVS repository locally.

Transport-independent remote operation should be a no-brainer, but
even many recent systems have felt the need to make up their own
protocols and network-level constructs.

Easier collaboration with non-committers is an often-requested
feature and a de facto property of distributed version control.

Unanswered questions:

- Do we need support for disconnected operation by more than one user
at a time (or perhaps more than one tree at a time) so that
uncommitted changesets can be shared? The non-committer changeset
support might cover this territory adequately, or not, depending on
how it ends up working.


2. Schema

- Supports arbitrary (smallish) metadata attached to changesets, and
also to files and directories
- Metadata (including on old versions) is mutable and changes are kept
in history (this includes commit message text)
- Provides provenance tracking for changesets/commits
- Commits/changesets are atomic
- Version numbers (for projects, files, and subtrees if any) are
sequential.
- Supports rename (of files or dirs) properly, and file history
crosses renames transparently
- Supports copy/duplicate (of files or dirs) properly
- Has a coherent semantic model of tree history
- Supports local-only changes that are not pushed back to the master
tree

Rationale:

While arbitrary metadata is a nuisance to support (compared to a small
fixed metadata schema) and in many cases using this metadata facility
(as opposed to storing information in an ordinary file in the
repository) would be a mistake, it is nonetheless useful for various
purposes. One of these is preserving old version numbers from a
repository conversion; given the large number of references to NetBSD
CVS file version numbers, including in places like security advisories
that count as "important", preserving this information and making it
searchable is highly desirable.

Metadata should be mutable because sometimes it contains errors. One
of the big weaknesses of current distributed version control is that
effectively all metadata is immutable once committed; this means any
botch not immediately detected is graven in stone for all time, unless
someone does a complete repository rebuild updating all subsequent
versions. Meanwhile, keeping the history should be a no-brainer.

Provenance tracking for commits is important for two reasons:
maintaining proper credit/attribution (which can involve legalities
via copyright as well as propriety) and also making sure that bogus
changesets cannot be introduced. In distributed version control
systems this becomes complicated and either requires an elaborate
solution (e.g. in monotone) or giving up on the problem entirely (e.g.
in mercurial). For a centralized system it is much easier but still
important, especially given tools for applying changesets that
originate from non-developers.

Changesets need to be atomic. Non-atomic changesets is a stupid design
flaw of CVS that we should certainly not perpetuate.

Version numbers need to be sequential so it's possible to tell easily
if a version you have contains a particular change or fix... and in
particular, tell easily without having to cut and paste a hash code
and go ask the version control system. You should be able to tell at a
glance from running ident on a binary whether it needs to be replaced
with a fixed version or not.

CVS doesn't support rename. We desperately need rename support because
large sections of the NetBSD source tree are in serious need of
organizational cleanup. File history should cross renames because if
you're looking at the history of a particular file, you shouldn't have
to stop and go search something else just because someone moved the
file around. This sounds like a no-brainer but a lot of "modern"
version control systems don't really get it right. Note that rename is
not semantically equivalent to copy and delete.

Likewise, duplicating a file is an action that should be explicitly
recorded; the support for sideways change propagation (below) requires
this.

A coherent semantic model of tree history is required in order to do
merges of changesets that reorganize the tree. Many "modern" version
control systems don't really get this right.

Support for local-only changes is highly desirable if you're carrying
local modifications; it's effectively the same as keeping private
changes as uncommitted modifications in your working tree, except with
more structure, proper history, and a way to explicitly make sure the
changes don't get committed by accident.


3. Branches and branch management

- Supports lightweight branches / multiple heads
- Supports full/named branches
- Supports something like hg bookmarks to keep git users happy
- Distinguishes branches intended to diverge from those intended to be
folded back in later
- Allows enforcing a graph of branch relationships
- Keeps track of which changesets from parallel branches have been
pulled in/merged across (including instances of separate but
equivalent changes)
- Also keeps track of which changesets from parallel branches have
been considered and rejected
- Supports this same form of sideways change propagation for files
that have been duplicated
- Supports hyper-branches (preferably)
- Supports local-only branches that are not pushed back to the master
tree
- Allows accmumulating small local changes into a single upstream
commit that neither loses the individual change history nor forces
other users to wade through it except by choice
- Maybe, support for local patch queues

Rationale:

Lightweight branches (that is, if you commit a change based on an
older version you just get another head) are necessary for
disconnected operation. These occur and get merged on short timescales
as a routine matter during development.

"Real" branches (branches with names that have metadata and tracking
information and so on) are also required, for releases and for
development of major features and so forth.

Mercurial was forced to add "bookmarks" to keep git users happy; a lot
of git users apparently don't understand anything besides git's insane
branch semantics and aren't interested in learning or understanding
what they're doing. We will need something like this too, in all
probability (and it's a useful feature) so it may as well get designed
in up front.

Branches that are intended to diverge (releases, for example, or
outright project forks) are fundamentally different from branches that
are expected to reconnect to their parent (e.g. feature development
branches) once the version control system has any kind of branch
management or tracking support.

If you have a lot of branches that are supposed to exist with certain
relationships to one another, it's fairly easy to accidentally break
this structure by merging with the wrong other branch; and if you do,
backing out of the resulting mess can be quite a nuisance. Therefore,
it should be possible to declare the intended structure and have the
system reject accidental attempts to violate that structure. (Note:
NetBSD may not need this. dholland specifically wants it and will put
in the work to get it.)

No existing version control system keeps track of which changesets
from branch A have and have not been pulled in to branch B, or is
capable of listing the ones that haven't been considered yet for
possible action. There is absolutely no reason, however, that the
version control system shouldn't be able to provide this information.
AIUI, for release branches releng currently has to maintain this
metadata by hand. (Update: "no existing ..." may actually be "no
existing free ...".)

If you duplicate a file, such as cloning a device driver template file
for a new driver, or starting a new pmap by copying an old one,
usually bug fixes applied to the original version should also be
propagated to the clone. The same kind of changeset tracking just
described for branches should be available for duplicated files, to
make sure this gets done and to allow easily keeping track of where it
has and hasn't been done.

By "hyper-branches", I mean a branch of the entire repository state,
including branches. (I have a vague recollection that somebody else
may be using the term "hyper-branches" for something else, in which
case we need new terminology.) This is, for example, something you
might want if you have two parallel versions of a project (e.g. a free
and pay version) and maintain those as branches, but then also want to
be able to take release branches of both at once. I have no idea at
the moment if there's a use case for hyper-hyper-branches (that is,
branches of hyper-branches) or not. (Note: NetBSD does not need this.
dholland specifically wants it for another project and is willing to
put in a good deal of work to get it.)

Local-only branches have the same rationale as local-only changesets.

Merging cumulative local commits into a single upstream commit makes
it possible to commit very early and very often (which is very useful
if you ever need to bisect later) without deluging other developers on
the project with a flood of tiny commits they don't care about in
detail. However, because you want to maintain the individual changes
in the master repository (to support that bisecting) but don't want to
show them by default, there needs to be explicit support for dividing
changesets into subchangesets and an explicit way to expand them when
viewing history. No existing system can do this; many can do something
similar, but in all cases I know of this either throws away the
fine-grained history or makes everybody wade through it afterwards.

Local patch queues (like mq in mercurial) are a useful way of
maintaining private changes and/or preparing batch commits. It is
probable that most of the use cases are subsumed by other features
(local-only commits, cumulative commits, etc.) and we don't also need
patch queues. Given the branch graph structure feature described
above, even the use case of preparing patchkits for third-party trees
may be better done with branches, although it might be worthwhile to
arrange a way to do branch push/pop in a way akin to patch push/pop.


3. Implementation

- Written in C
- Doesn't depend on anything other than standard system libs
- Decently fast
- Scales to large trees with deep history
- Supports inotify/kqueue/whatnot for monitoring large checkout trees
- Install doesn't spew tons of crap all over everywhere
- Has an interface for plugins and/or extensions

Rationale:

Writing in C (or perhaps C++ but C++ is not really a sane choice of
language) with no major deps is a requirement for importing into base,
where the tool used to manage the NetBSD source tree should be found.

Being decently fast is necessary to avoid driving users crazy. Scaling
is necessary for use on/in NetBSD.

The major performance bottleneck for most systems on large trees is
scanning the tree for files that have been modified. This inevitably
takes as long as doing find . -ls, and on a tree the size of NetBSD's
source tree that takes a while even when the whole tree fits in RAM.
Many recent tools have a gizmo that starts a daemon using inotify or
similar to monitor the working tree in the background; then the
explicit search can be avoided and things become much faster.

A tidy install is desirable for a number of reasons (integration into
base being one of them) and should not be a major problem.

We want some kind of plugin/extension interface because, at a minimum,
there are probably some graphic tools that should be available and
they can't be part of the base install of either this program or
NetBSD.


4. User interface

- Clean, small command set
- No weird semantics
- Search support for metadata (including/also change messages) as well
as searching file contents

Rationale:

All of this is pretty much obvious. By "no weird semantics" I mean
anything from oddities like mercurial's tags to core design mistakes
like git's branches... or things in between like subversion's
branches; anything that violates the principle of least surprise or
that requires lengthy explanation/justification for why it doesn't
behave the way a reasonable person would expect.


5. Miscellaneous other features

- Can remove/obsolete/blacklist unwanted changesets
- Supports splicing of equivalent but technically unrelated versions
- Can stash local changes temporarily
- Can check out subtrees
- Can explicitly revert files or whole subtrees in a checked-out tree
to earlier versions
- Supports configurable keyword expansion

Rationale:

Blacklisting or otherwise getting rid of unwanted changesets is a
non-negotiable requirement for legal reasons.

We want splicing so we can, at some point in the future and if we so
desire, pull in the CSRG version history and connect it up with our
own.

Stashing local changes is necessary if you can't have uncommitted
local changes while merging, and it really doesn't make sense to allow
that. People do gripe, but the best way is to stash your changes,
merge, and unstash them. Otherwise if you get a merge conflict in a
file you've also got local changes to, it becomes an awful mess.
(Note that in comparable situations CVS makes you check out a whole
new tree...)

Checking out subtrees is widely desired for working on single programs
or (in particular) checking out only the kernel.

Reverting portions of the tree locally is often necessary for one
reason or another in practice; lack of adequate support for this in
most of the "modern" version control systems has been and remains a
barrier to adoption in/for NetBSD.

We still need to be able to run ident on binaries and get useful
information out. Keyword expansion is not the only way to accomplish
this; but it's easier to deploy and use than any of the alternatives.
A reasonable implementation should not suffer from the persistent
aggravations that CVS keywords often cause. (All expansions need to be
invertible; all actions, particularly diffs and merges, should be
always done using the unexpanded form.)
@


1.2
log
@typo.
@
text
@d181 1
a181 1
are expected to reconnect to their parent (b.g. feature development
d200 2
a201 1
metadata by hand.
@


1.1
log
@Stuff distilled from my notes and previous arguments and bikeshed sessions
@
text
@d47 1
a47 1
feature and a de facto property of distributed version contorl.
@