Merge algorithm

Three-way merge

We implement a three-way merge. When doing a merge, we have three sheets:

(One can also think of L, R, and P being Left, Right, and Parent).

Assumptions

Sheets should be in "portrait" format, with data of the same type grouped in columns, and data referring to the same entity grouped in rows.

Sheets may have a row or two of header information in the first few rows. This isn't required, but it is anticipated that the "top" of a sheet may have different content than its "body".

Empty cells are treated as "unspecified" for merging purposes.

Comparison with text merge

Compared with standard three-way text merge, sheets have some differences.

A reordering of lines is no longer a drastic change. For many sheets, rows may be treated as elements in an unordered set.

There is a uniform structure to every line in sheets, and this structure may get systematically permuted.

Representing conflicts

Some merges will need human intervention. It is useful if a conflicting merge can be represented as a sheet. The most convenient way to do this is by insertion of extra conflict-control rows and columns. The version control tool should detect and prevent unresolved conflicts from being checked in.

Imagine a three-way merge where all sheets SL, SR, SP are unrelated. A "dumb" conflict sheet can be produced as follows:

This conflict sheet has the benefit of being reversible (well, there are issues with treatment of blank values). The purpose of merging is to reduce redundancy in this sheet, ideally to the point where the special tag T0 doesn't need to be used at all.

An example of a full conflict sheet (with T0 set to [conflict]):

LOCAL sheet
-----------
name,  age, location
Paul,   99, Space
Noemi, -10, Imagination

REMOTE sheet
------------
item, cost, qty
frog,   10,  1
bell,    5, 10

CONFLICT sheet
--------------

[conflict], local, local, local,      remote, remote, remote
local,      name,  age,   location
local,      Paul,   99,   Space
local,      Noemi, -10,   Imagination
remote,                               item,    cost,   qty
remote,                               frog,      10,    1
remote,                               bell,       5,   10


Now, in less extreme cases, we should be able to find columns and rows that we don't need to duplicate, since they either exist in both (we will label these "share") or can be easily merged (we will label these "merge"). And if everything is shared/merged, we don't need to include the conflict row and column.