[WIP] RFC: Index Build 4.x #11018

Gerold103 · 2025-01-08T19:00:42Z

Gerold103
Jan 8, 2025
Collaborator

Reviewers

Tickets

Non-blocking index build blocks synchro queue #10766

Changelog

v2: Chose "lazy index" as the primary solution.
v3: Renamed build=<mode> to wait=<bool>. Wrapped the rejected alternatives into drop-downs, to make the document smaller. Added a new open question about wait name.
v4: Force-async txn solution alternative is rejected.
v5: Large rewrite based on the discussion on a meeting. Index build is always async and automatic now. Besides, the document tries to address the concurrency bugs of the current index build strategy.
v6: More detailed description of how the build works, regarding iteration of the PK and concurrent transactions. Also new questions for discussion about certain breaking change proposals.
v7: Release will be 4.x, nontrivial alter is deprecated because it is a problem even for users, the replica-local state is in a new replica-local space.
v8: _index_local state is dropped in case of successful build, temporary indexes are explicitly mentioned.
v9: The index build algorithm is made more engine-agnostic but works via space::on_replace() trigger now.
v10: More details about alter ban and how to handle it.

Summary

There are 3 large problems right now. The document tries to address them all in a single solution.

Problem 1 - replication

When a space is big enough, building a new index on it can be quite long. Minutes, hours, depending on the space size. The same is about index alter - it might require index rebuild, space fullscan. That isn't a big deal locally on the instance, because the build is asynchronous - transactions can still be processed, even on the changing space.

But it gets complicated in a cluster due to the following reasons.

Replication gets stuck in a replicated cluster. Yes, the index build is async fiber-wise, but it blocks the current fiber. The blockage happens on-replace into _index space. Not on-commit. Because of that the applier's feature of committing the txns asynchronously doesn't help. The longest part happens before the commit.

The replica's lag will grow, it won't receive any new data until the build is finished. But the replication still is alive, and at least it doesn't block the transaction processing on the master when the replication is asynchronous. Unlike the next problem.

Master transaction processing gets stuck in a synchronously replicated cluster. Because the index build transaction on the master blocks the limbo until the appliers also apply it and write to their WALs. And that will last until the quorum of replicas have finished the index build.

Essentially, in a synchro cluster with large spaces it becomes impossible to create new indexes. It requires hacks, like creating a new space with all the needed indexes and same format, then slowly copy the data from the old space, in multiple small transactions, then delete the old space. Sounds not complex really, but it requires the user to change their code to maintain this "migration" process by writing into both old and new spaces while the copying is in progress.

Problem 2 - concurrent rollbacks

The idea with the yielding index build (it yields every few thousands of tuples inserted to let other fibers work) was that it would allow to build large indexes without stalling the entire TX thread.

But it adds concurrency problems. Like what happens if the building index took a tuple, which was rolled back later? Some of those cases are already handled in the code. Normally such tuples are also deleted from the index. However in a few situations this won't happen: #10929, #10930, #10931.

Those all seem to be united by a shared fundamental issue: the building index is invisible to transactions. They implicitly add tuples into it (via space:on_replace trigger), but they can't delete tuples from it (because there is no such trigger as space:on_rollback or something, and the :on_replace isn't called again for rolled back statements, which would be wrong to do anyway).

This brings all sorts of nasty bugs in, like when a txn was started during index build, its commit was started, then index build is committed, and then the txn gets rolled back due to WAL write error or because the synchronous replication leader was changed and cancelled this txn. To normal indexes such events have access via struct space::index array, and can clean their crap out. The building indexes OTOH are invisible and sometimes are keeping the trash tuples in.

Problem 3 - nontrivial index alter is unpredictable for users

Imagine an index had 2 key parts. The user's code was using index:get/delete/update based on these 2 parts. Types don't even matter.

Now imagine the user wants to make the index have 3 parts. The user would do index:alter({parts = new_3_parts}).

In both async and sync replication this transaction is going to become visible on the master earlier than on the replicas. It means, the nodes in the replicaset are having different definitions of the same index for a while. Code working on one node, like index:get{part1, part2}, won't work on another node. And then at a random point in time would stop working on the first node too.

The user can't switch their code to the new key parts at once or in any other controllable way besides adding checks right in the code for how many parts does an index have, and acting depending on that.

This is very unhandy. To the point that doesn't seem like anybody is using such backward incompatible alterations in prod.

OTOH, the alter of the sorts integer -> number type change or unique -> non-unique are fine. They get executed immediately because don't require an index rebuild or re-check, and the old code works with the new definitions too (besides index:get() when uniqueness is lost, but it can be replaced with :select() before the alter).

✅ Solution: non-blocking background index build

Lets see what can be done on the high level.

Consider the first problem - a long index build blocks replication because the transaction can't be committed until the index is built. The solution is right here - allow to commit the transaction before the index is complete. Build the index lazily, in "background" (not blocking neither the current fiber nor the transaction), after the commit. The index entry is added to _index instantly and the txn gets committed as fast as if the space was empty.

Coincidentally, this also contains the key to the solution of the second problem - the index becomes visible in _index and struct space::index array instantly, before its build is started. It means, that if the index would start building after commit into _index, it would automatically 1) apply all new transactional statements to the already inserted keys, 2) apply rollbacks for the already inserted keys from all the rolled back transactional statements. Because all of that is just DML, and DML always iterates over the entire struct space::index, both on transaction statement creation and on rollback. See memtx_space_replace_all_keys() for Memtx, and vy_replace() and the other similar ones for Vinyl.

The presence of alter directly affects how complicated this idea is going to be in details. If alter exists and an index alter/create is instantly visible, then each index would have 2 states: actual and target (build is in-progress), and they would have to be both maintained in at least the system spaces and in the internal objects like struct index. Besides, no matter how to design this, the original issue of nontrivial alter being unusable in a replicaset won't go away. So the proposal is to forbid nontrivial index alter starting from 4.x version. This simplifies the rest of the proposal and won't allow the users to shoot their own legs off.

Since the non-trivial alter support is being deleted entirely, it means the users can't upgrade to 4.x if their xlogs contains such alters. Thus the users potentially having alters in xlogs must stop doing more of such operations, do a snapshot, and only then upgrade to 4.x. In order for that not to be a surprise Tarantool 3.x will start printing a warning when a non-trivial alter happens, that such operation won't be possible in 4.x.

The same problems existing for space alteration (it might require to validate all data which makes such transaction long and blocks the replication) are ignored for now. It is not considered as important as index creation issues.

Regarding trivial alter - it includes not only type extensions and uniqueness drop, but also engine-specific settings which don't require rebuild or validation of the entire index. For Vinyl that would be settings like page size. For memtx - hints in the tree index. The hints are going to be allowed to be changed, but the change will only take effect after snapshot + restart. Because hints on/off requires a rebuild, which won't be possible in 4.x on the fly.

The remaining part of the document tries to elaborate this new way of building the indexes in greater details.

Build state

Image that an index is being built after a commit into _index. This raises the following questions:

What happens if index build fails some time later?
What happens if index build succeeds? How to save that and to show to user?

Assume the replication doesn't exist yet. It is a good idea to solve those issues as conveniently as possible in scope of one instance, and then try to scale the best solution across the replicaset.

Lets introduce some sort of index build state. It should provide the following:

Describe the current progress at least in a way "finished" or "in progress".
Show error message if the build has failed.
Be persisted. So on restart a failed index wouldn't start building again. And a completed index could be waited on, before box.cfg() completes.
Support trivial alteration (parts type expansion, uniqueness drop, index rename).
User can see the index build state somewhere.
Existing storage-side code and connectors should keep working at least as long as they don't touch _index directly. It would also be nice not to break _index too much.

The proposal is to create a new replica-local space _index_local. It has the following format: [space_id, index_id, state].

The state field describes the index's local state when it doesn't exactly match the row in _index. Currently it would only contain the following:

_index.state = {
    build = 'in progress' / 'fail',
    error = <build_error_if_any>,
}

It is extendible, since this is a map. And can be indexed using JSON indexes if needed in the future. For example, it might be handy in the future to support nontrivial alter again.

When an index is created, it is instantly committed into _index space as it is now and into _index_local space with in progress build state. The goal of the instance is to build the index so it reaches fail or success state eventually. If fails, the state.build automatically becomes fail and the state.error is filled with an error object encoded in MessagePack. In case of success the row is just deleted from _index_local.

The _index space format and all its fields meaning remain as is. The only difference is behavioral - commit happens instantly, but the index itself isn't usable until fully built later.

An example: imagine there is a space with an ID 512. A new index is created for it:

_index:  {512, index_id, 'test', 'tree', {unique = true}, {{2, 'integer'}}}
_index_local: {512, index_id, build = 'in progress'}

After a while it gets built and the tuple is dropped from _index_local:

_index:  {512, index_id, 'test', 'tree', {unique = true}, {{2, 'integer'}}}

Now imagine the parts are changed to {{2, 'number'}}. This is how it gets reflected in the tuples:

_index:  {512, index_id, 'test', 'tree', {unique = true}, {{2, 'number'}}}

Note, how the _index is just changed right away, and _index_local wasn't affected. Because a rebuild wasn't needed and becomes not possible anyway. So the local and global states of the index match, and the local one is then just dropped.

DDL validation and automatic alteration

Imagine the above is done. This means, that insertion and deletion operations on _index space would have to also touch _index_local space. Including potentially existing old code from users who could be touching _index space directly.

Users must not do that, yes (touch system spaces directly), but it is easy to think of a usecase - a backup tool. If one exists out there, it might be scanning each space in Tarantool, including _index, not even looking which of them are system or not, saving the tuples somewhere, and then inserting them again on a new instance to restore from the backup.

If the new build way would require to manually update _index_local, it would break such simple but useful apps.

Luckily, we can make _index_local update automatically, right from _index:on_replace() trigger. It is not hard to do.

Given such drastic changes in the index build way, it makes sense to list all the new DDL validations and hooks:

Renames, key part type expansions, and uniqueness removal are just executed right away, like before. They don't require an index rebuild and are backward compatible in the user code.
Any index alter requiring an index rebuild now is forbidden. For example, number -> integer is no longer allowed.
When a new tuple is added to _index, the trigger must make a replace into _index_local to create its build state which is going to be in progress.
When a tuple is dropped from _index, the trigger must also delete it from _index_local.
When a tuple is being inserted into _index_local, it is only allowed if all of the following is true:
- There is a corresponding entry in _index.
- The state.build is in progress.
- The state.error is nil.
When a tuple is being dropped from _index_local, it is only allowed if there is no more entry for it in _index or the index is successfully built.
An update of an _index_local tuple can only be done for the following cases:
- state.build from in progress to fail, and the index must be actually failed. Again, will be visible in struct index.
- state.error from nil to an error object, and the build must be fail then.

After the nontrivial alter deprecation and removal, the checks above will be even simpler than the current state of alter.cc.

All those changes for _index_local space and validations of everything are automatic and happen in _index:on_replace() triggers.

It means, if there is any code doing _index:update/delete/replace/..., it would keep working. With nuances, but simple things will remain functional.

Access

Right now if an index is in struct space::index array, then it is usable. And its vtab points at Vinyl's LSM or Memtx hash or tree functions. Vtab would only depend on the index type.

Assuming an index is added to struct space::index instantly, even before it is complete, this can't be left like that. Otherwise incomplete indexes would be readable, which is not acceptable. And yet the DML transactions and their rollbacks must be allowed to happen on such triggers, to a limited extent (such as ignore updates happening on not yet indexed tuples or rollback changes done on already indexed tuples).

To support that there is going to be a new type-agnostic index vtab. It replaces struct index::vtab while the build is in progress. It saves the original type-specific vtab somewhere, and proxies to it all the updates that are allowed to be done. All the DQL (get, select, pairs) are going to be rejected at the level of this new vtab.

When the build is complete, the proxy-vtab is dropped, and the index becomes fully usable, if build was successful. Otherwise if the build has failed, the vtab rejects all DQL with an error.

The DML is at the engine-level, so about that there is a section further in the document.

Replication

Luckily, the approach above will just work in a replicaset without any needed changes on top. It wasn't so simple in the previous iterations of the document, but the currently proposed way is already replication-friendly.

Indeed, assume a master added a new tuple to _index space. It gets committed immediately, and then the build works in the background. The master-local _index_local space is updated via _index:on_replace().

The _index tuple gets replicated and reaches the replicas. They will also execute their own _index:on_replace() triggers and update their _index_local space. From this moment it all works same as on the master.

Lua and other APIs

The whole idea of going through all those vast changes was that the Lua API of the index creation and alteration won't change. Not even just the methods of the index Lua object, but ideally even its "low-level" API - the _index space operations. How it was working before should keep working in the new versions, at least the simple cases, and yet not have the problems described in the beginning of the doc.

It means that indexes created as space:create_index(...) would internally start using the new way of index creation. :create_index() method would now launch index creation by committing it to _index space, and then block the current fiber, to wait until the index is built successfully. From user's PoV it would look same as before, on the success path.

To support non-blocking index build there is going to be a new option: space:create_index(name, {wait = true/false, other_opts...}) - wait. It is a flag which defines what create_index() should do - wait for the index to get created and built (wait = true, default), or start its build and leave (wait = false).

Ifwait = true, and the index build fails, then space:create_index() will also drop the index. So it doesn't remain in a failed state, clogging _index space, hanging there forever. If wait = false, then the failed index remains hanging in the spaces _index and _index_local until it or the entire space are dropped.

The option only regulates how space:create_index() works. It is not propagated to _index space and isn't stored anywhere.

The changes, which would become visible to the users in any non-trivial setup are the following:

❗️ If space:create_index() fails before reaching the end of the waiting, or the instance gets restarted, then the failed index will stay in _index space. It might even succeed in building, if the instance was restarted, and after restart the index has finished the build successfully. In other words, space:create_index() is not transactional anymore.
❗️ Any code making _index:insert() would get executed immediately, but won't actually produce a fully working index.
❗️ space:create_index() on success does not guarantee anymore, that with the synchronous replication the index is created on a quorum of replicas. Previously it was ensured, but now the index creation success only means local success with eventual consistency on the replicas.
❗️ index:alter() is no longer possible for any non-trivial changes.
❗️ Index creation might succeed on a subset of nodes in the replicaset, and the replication will keep working. Previously a failed index build would cut the replica off from the master.

The absolutely non-problematic API changes will happen in Lua index object - it will get a new field state which will show the build status and an error if the build has failed.

Also the index will get a new method :wait(timeout). It will block the current fiber until the index build is finished. If this was an error, then the method will throw the error. Otherwise return true. On timeout the function returns false.

Internals

Here are some implementation details that are worth mentioning before the implementation starts.

Temporary spaces

Some spaces are fully temporary. I.e not only they are replica-local, but also aren't even stored in WAL. They are visible in _space, but absent in the xlogs and aren't saved into the snapshots.

Their indexes are visible in _index, but are also temporary. I.e., not persisted.

The new space _index_local must take that into account. All its updates related to temporary indexes must also skip WAL and snapshots.

Worker fibers

It is suggested to create one fiber for each background index build, and store it directly in struct index (maybe not directly, but in some new member of struct index which describes the entire build state). This would be easier to control in the code, when the entire index state is in one place, easily visible directly in each index object. The fiber would be forcefully cancelled and joined when the index gets dropped. And the state would delete itself automatically, if the build completes to the end (success or fail, doesn't matter). The fiber belongs to the system - it is visible but not cancellable or joinable by the users.

Having one fiber per index also would simplify storing build context and making _index_local transactions. If the fiber would be one for all indexes, it would have to store some sort of a context for each index anyway, attach/detach txns/read-views/whatever, and this would look clumsy.

In the end of work the fiber will try to update the _index_local tuple with the actual state or drop it. If the txn fails due to any reason (WAL, conflict), then the fiber will retry every few seconds.

It is still an open question though. In reality it might happen, that having fiber per index would be harder to control, to shut them all down at once for example, when the instance is being shut down. This might change during the implementation.

Generic build algorithm

The current code of index build is engine-specific. See memtx_space_build_index() and vinyl_space_build_index(). They are mostly identical really, by the main logic. Just a few tweaks are done in each of them to support the Memtx and Vinyl transaction managers, isolation, and various recovery stages.

The recovery-stages tweaks seem to be replaceable by enum box_recovery_state, which could unify the build process, and the code would fallback on recovery to some new methods like engine->recover_index() or something.

The transactional tweaks seem to be coming mainly from the necessity to build the index in a "read view", in a single long transaction. They try to protect the new index from rollbacks of existing transactions by doing wal_sync(), by aborting concurrent txns on the same space, etc, but it doesn't really work as seen in the tickets mentioned in the beginning of the document.

The worst part is that it can't even possibly work okay when synchronous replication is enabled. Because for that the old way of index building has to wait until all the limbo transactions are confirmed, to be sure their tuples won't be rolled back. However while the new index is waiting for the limbo confirmation, the limbo might get new transactions! This is a problem of Achilles who can never catch up with the turtle. It could be resolved by freezing the limbo entirely while the index is waiting, or by trying to commit the index enabling via the limbo too. But that all doesn't sound very reliable and requires the index build process to be aware where are the transactions spending time before the commit (WAL, limbo, what is next?). Besides, a limbo freeze would affect the master's behaviour for regular unrelated non-DDL txns.

Instead, the new way of index build suggests to think of it as a "functional index where the function is true for all keys < current build key", and the function's key gets updated gradually, moving forward. It is logically identical to creation of a new space having the new index as a primary key and copying all tuples to there from the old index, while having :on_replace() on the old space to catch updates and add repeat them for the new space in the same transaction.

Here is the proposed algorithm:

Add the new index to struct space::index array on commit into _index. It becomes visible everywhere, but not available for DQL (at the index_vtab level).
The new index is created always non-unique. So the concurrent DML operations wouldn't break on it when they insert a duplicate.
If the space is actually unique, then set an on_replace trigger for that space. The trigger must do the following:
- For each txn statement check if it creates a duplicate. If it does, then create an on_commit trigger for that statement.
- The on_commit trigger must abort the new index build.
Abort all existing transactions for that space.
Sync WAL and the limbo to make sure that all the being committed txns also finish committing (because they can't be easily intercepted by the new index already).
Set the primary build key to nil (= the minimal possible key).
Read the whole primary index in a loop, using READ_CONFIRMED isolation. On each iteration:
- Update the primary build key to the current tuple.
- Insert each primary tuple into the new index. If there is a duplicate, then check the duplicate's PK.
  - If the duplicate's PK is != the current primary build key, then it is an actual duplicate - abort.
  - If the duplicate's PK is = the current primary build key, then this tuple was added by a DML operation, which are happening concurrently and this tuple was inserted into the new index ahead of the current build key.
- Yield every N tuples.
- When reached the end of the primary index, break out of the loop.
Wait for all current transactions on that space to become committed or rolled back. (Or just all current transactions, if that is simpler.)
Make the index unique if it was actually created unique. So the new DML operations would start failing on trying to insert a duplicate.
Make the index visible and regular (i.e. remove its build-time DML logic about the filtering by the build key).

Pros: engine-agnostic besides perhaps some assertions which might have to be added in the engines' code.
Cons: uses space::on_replace trigger which will make user's DML on the Vinyl space always do reads.

A couple of details about the above written procedure:

Why make the index non-unique at first? It allows not to care about the concurrent DML and just let it work. It will never fail on a duplicate key and the new index won't affect the concurrent transactions.
Why not make the duplicate check right in the txn_stmt::on_commit trigger? Why do that right inside the transaction? Because in Vinyl it might require reading, which can yield, and on_commit triggers can't yield.
Why getting a duplicate with PK < the current build key is abort? Because it means this build procedure has already inserted this tuple. Even if it was later updated by a concurrent DML.
Why meeting a duplicate with PK > the current build key isn't ok and still is considered a legit duplicate? Image this test:

- Step 1: insert first tuple.
pri_idx: {1, 4}, {2, 3}, {3, 2}, {4, 1}
                 ^pk
new_idx: {1, 4}

- Step 2: insert second tuple.
pri_idx: {1, 4}, {2, 3}, {3, 2}, {4, 1}
                         ^pk
new_idx: {2, 3}, {1, 4}

- Step 3: insert a new tuple {5, 2}
pri_idx: {1, 4}, {2, 3}, {3, 2}, {4, 1}, {5, 2}
                         ^pk
new_idx: {2, 3}, {1, 4}, {5, 2}

- Step 4: try to insert the next tuple into the new index
pri_idx: {1, 4}, {2, 3}, {3, 2}, {4, 1}, {5, 2}
                                 ^pk
new_idx: {2, 3}, {1, 4}, {3, 2}, {5, 2}
                         \____________/
                           duplicates

We then get duplicates where the new tuple is {3, 2} and the old one is {5, 2}. The old's PK is greater, but it is still a bad duplicate with with the build must be aborted.

However this won't make the code entirely engine-agnostic anyway. To support this way of building the engine must provide a way to "insert" into an index, not into the whole space. Most likely it can be done by implementing the index_vtab::replace for Vinyl or adding a new method specifically for building. Luckily, it won't have to support anything but just insertion. No deletes, updates, or anything alike.

Upgrade and schema

The script upgrade.lua becomes problematic, because to make an initial snapshot on 4.x it would have to create the initial schema from 1.7.1 and then execute all the updates of all the versions until 4.x, including alterations. If so, then the file would have to be squashed to only support the initial schema of the latest 3.x as the first schema of 4.x.

Besides, there is a dummy schema created in schema_init() right now. It creates some system spaces with default settings (like that all indexes are trees) which mismatch what is later applied from the snapshot, which triggers an alter. So schema_init() will have to be rewritten to create all the spaces in their latest versions.

👌 Answered questions

What to do with _vindex? Should it work same as now? Returning the build state from it wouldn't change the old connectors who won't read that state anyway. So as long as an index is returned from _vindex, no matter its state, the old connectors will keep seeing this index as active. I propose not to touch _vindex at all then.

How would a hash index work? Firstly, it only exists in Memtx. Secondly, if the primary index is not hash, then build of a secondary hash index works same as a tree index. But if the hash index is the primary one, then it is problematic - the build code can't yield. Because in a hash index there is no order. Iteration of the primary index has to be done with no yields, because otherwise during a yield some insertion might happen, and there is no way to see if it happened on already processed tuples or on not yet processed ones. The primary index build key makes no sense, basically.

It works the same way in the current way of index building. And would remain like this. Just that this yield-less build would now happen after commit into _index, not before.

Alternatives

♻️ Solution: alter support and better backward compatibility

The v6 version of the document proposed a solution which would keep nontrivial alter support, and wouldn't require any new spaces.

The key idea was that an index can have 2 states: actual and target. The actual is the replica-local state, the one being usable for DQL. The target is the global state set by the master. The replicas are trying to reach it by building the target version and eventually replacing the actual one with the target one.

The replica-local state was stored in a new field in _index space. The field was storing the replica-local state and was itself also replica-local. I.e. the statements touching it would be having GROUP_LOCAL replication group.

The solution wasn't accepted, because was overly complicated in its attempts to keep the nontrivial alter alive, which apparently isn't even needed.

♻️ Solution: lazy index with auto-complete

The currently chosen solution with "lazy indexes" requires the user not only to create the lazy index, but also to manually make it non-lazy when the build is complete.

Why not make the index automatically update the build mode, when it gets ready? Just make an internal transaction, which changes the opts.build to 'now' (or wait = true, or whatever else is the naming in the primary solution atm).

There is a few of reasons, why not:

If the master builds the index much earlier than the replicas (for example, a replica could restart in the middle of the building, and had to start building the index from zero), and it updates the build-mode, then this DDL transaction would be stuck in the limbo until a quorum of replicas builds the index too. And it returns us back to the original problem. This means the build mode alter txn must not even try to commit until the index is finished on a quorum of instances.
It is not always possible. If the cluster becomes read-only due to any reason, the mode update won't work. It would have to wait for the instance to become writable, which makes the indexing logic slip into the scope which shouldn't be of its concern. Spaces and indexes and their locally stored data in general shouldn't care about instance status relative to other nodes.
It can potentially lead to a mess in master-master (the original ticket is about synchro, but as it was shown in the intro, this also concerns async replication). AFAIU, in such setups users must be very careful not to do txns on random nodes on the same data, or it can lead to races (like the easiest one, make different replaces on the same key on 2 nodes - they would "exchange" the values, not sync to one value). Assuming a user managed to do that successfully until now, we would break this setup. Because Tarantool would start doing internal txns on the same data (same _index entry on multiple nodes). That doesn't look safe.

♻️ Solution: nothing

The issue in the ticket isn't really a bug. It is an inconvenience, which has a workaround explained right in the intro - copy the space manually, write to both spaces, then replace the old one.

The only problem is that the user would have to support that in their code.

Lets repeat the solution here for clarity. When a user wants a new index or alter an existing one in a non-trivial way, or to change a space format, they do this:

Create a new space with all the same metadata and indexes as the original one.
User creates the needed new index on the new space (or alters the one that needs a change) or alters the format.
User duplicates all the replaces and deletes of the old space to the new space. This can be done, for example, by setting an on_replace trigger on the old space, which does the same work on the new space.
User iterates over the old space and copies the tuples into the new space, with yields every N tuples.
When the work is done, the user in a single transaction drops the old space, and renames the new space to the old one's name.

Pros: don't need to do anything, already works.

Cons:

Memory waste on the duplicate data until the old space is dropped.
Inconvenient, easy to make a mistake in the migration process.

♻️ Solution: replica-local index

The problem of index creation/alter is hitting the replication hard. One approach could be to attack the replication shortcomings then. That is, drop the replication from the process.

Lets imagine that the replicas and master could build the same indexes independently, fully local. And when finished, the master would in a single small DDL transaction "enable" this index.

The index creation would then be a 2 step process. 1 - create a local index on all replicas. 2 - turn the local index into a global one on the master.

This needs 2 features which aren't available yet, but aren't hard to add:

Replica-local index.
Local->global index promotion.

Replica-local DDL is not unusual for Tarantool. There is right now a space type temporary (not to confuse with data-temporary). It can be created on read-only replicas, can have its own indexes, is visible in _space and its indexes in _index, but it is not replicated, and its data isn't stored in WAL.

Replica-local persistent data also is not a new thing. Tarantool does have "local" spaces. They have replicaset-global meta (_space and _index rows) and their data is persisted, but not replicated. They can only be created by master, but can take DML on any instance and it is not replicated.

The proposal is to introduce replica-local indexes. They can be created by any replica, even read-only, on absolutely any space. This index is persisted in _index and is not replicated.

Creation of the index will not affect replication at all, and won't block the limbo, because replica-local transactions are not synchronous by definition.

To create a new global index, the user then would then go and create a replica-local index on each instance.

Then to make it global the user would on the master instance make index:alter{is_global = true}. Locally it works instantly. When this txn comes to the replicas, it will try to find a replica-local index in this space with all the same meta besides index ID. If found, it also works instantly, by changing the index ID to the global one (ID is primary, so it would mean moving local index's data to the new global index with the global ID, and dropping the empty local index). If not found, a new index is created as usual.

The solution not only allows to create/alter indexes in the cluster bypassing the replication, but also allows the user to purposefully create replica-local indexes without ever making them global. It could be handy to reduce memory usage on the master and speed-up master's DML. Master would only store the unique indexes and handle DML, and the replicas would store the other indexes + serve DQL.

The cons is that the user has to visit each replica to create the replica-local indexes on the first step.

Pros: introduces a new feature - replica-local indexes, which can be used not only for replicaset-wide index building.

Cons: needs 2 steps, one of them to be done on each instance in the replicaset. Including new instances, where this index won't appear automatically.

♻️ Solution: space clone

This one is discarded, because it tries to re-invent `space:upgrade()` which is already available in EE.

The proposal attacks the issue from another angle - if a space alteration is long, then users would typically, in another DB, make a second space with all the same meta + the alterations, and then copy the data from the old to the new one + repeat changes to the already copied keys. We could do the same, but automate it.

That is, Tarantool would allow to clone a space with any of its indexes and metadata altered. Once the cloning is done, the user could do the final "drop + rename" themselves.

If designed carefully, this could be an interesting tool to do more than just a new index creation, like:

Add more than one index.
Rebuild an index (change its parts in some not forward-inclusive way).
Alter space format.

In Lua code it could look like this:

-- User creates a new space with any indexes and format.
new_space = create_new_space()

-- User starts non-blocking copying using space:clone() method.
-- It works in transactions max 1000 rows each.
--
-- The new space is attached to the old one, and all the changes to the
-- already copied data are also repeated, so nothing is lost.
--
-- The new space during this process is read-only. User can't alter it,
-- can't write to it.
old_space:clone(new_space, {batch_size = 1000})

-- When the cloning is complete, the old space will only keep replicating
-- the DML. Background copying is done.
--
-- Now the user can replace the old one with the new one:
new_space:replace(old_space)
-- This operation renames the new space and changes its ID to
-- match the old space. The old one is dropped. All happens in one
-- transaction. ID has to be preserved so the remote ID-based connectors
-- would continue writing to the right space.

More interesting outcomes:

Copying could be done for spaces of different engines (as long as we introduce cross-engine transactions).
Memtx space cloning might be relatively cheap if the spaces would have the same format. The new space would just ref the tuples of the old one.

Pros:

Solves the problem of any space-related DDL.
Replication is untouched (depends on how to implement, see Cons).
Not much action from user needed.

Cons:

Potential memory waste on the duplicate data until the old space is dropped.
User needs to make 2 steps.
Need to design what happens if the master is changed. Is this whole process happening in-memory? Should it be somehow continued on the replicas if master dies? But then to which extent? Replicas can't write safely even if are writable, because it would break master-master.

♻️ Solution: force async

Why not make index creation and space format alter `TXN_FORCE_ASYNC` txns? I.e. they wouldn't block the limbo. Right before their completion on the master we could lock the limbo, wait until all its txns are confirmed, and then commit this one.

Seemingly it could be safe - If a master at the time of this DDL commit had all the data of this space replicated and confirmed, it means all the replicas must be able to apply this format/index too.

There is waiting time until the limbo is locked and gets emptied, but it only depends on the replication and replicas' WALs speed. Not on the space size.

It wouldn't work right away, because the appliers on the replicas would still be executing those txns for a very long time, but this can be solved by executing them in their own fiber, not in the applier's main fiber. Perhaps this would bring other complications, and yet the solution looked worthy of consideration.

However it was rejected, because firstly, replicas might have different data because of before_replace triggers. Secondly, this introduces a point of potential split-brain. Any async txn in election=auto can cause a split-brain even with a proper quorum.

Pros

Doesn't seem too hard to implement.
Solves the problem of any space-related DDL.
No need for user action at all. It just works.

Cons

There might be hidden consistency complications. For example, what happens, if master commits this txn basically async, and then stops being the leader? Then another leader would reject that txn and cause a split-brain. Can this be mitigated? No, it can't.

⚒️ Build: with engine-specific parts, intercepting every DML, catch-up via `txn_stmt::on_commit()` triggers.

❗️ Below is unchanged initial algorithm of the index build proposed in some older versions of the RFC.

Pros: doesn't use space::on_replace() trigger so the Vinyl space won't start doing reads on every DML just to fetch the old tuple for the trigger.
Cons:

Quite a lot of engine-specific code.
Needs to insert concurrently committed tuples into the new index in txn_stmt::on_commit() triggers, which won't be possible for Vinyl due to yields on reads (any on-commit code can't yield). This part has to be reworked if the solution is brought back due to any reason. For example, actually do the DML before the commit. If get a duplicate, then abort the index build in txn_stmt::on_commit() trigger but don't abort the current transaction (it would affect the user's code).

Further down is the algorithm.

Add the new index to struct space::index array on commit into _index. It becomes visible everywhere, but not available for DQL (at the index_vtab level).
Set build key to null. I.e. the minimal primary index key.
All new transaction statements trying to do DML on that index get filtered (by the temporary index_vtab or some other way):
- Ignore the txn statement if it is > build key.
- Apply the txn statement if it is <= build key.
Abort all existing transactions for that space.
Sync WAL and the limbo to make sure that all the being committed txns also finish committing (because they can't be easily intercepted by the new index already).
Read the whole primary index in a loop, using READ_CONFIRMED isolation. On each iteration:
- Update the build-key. (Can be build-tuple, to just reference it and not extract the key into new memory.)
- Insert each primary tuple into the new index.
- Yield every N tuples.
- When reached the end of the primary index, make the build key = "infinity" and break out of the loop.
Wait for all current transactions on that space to become committed or rolled back. (Or just all current transactions, if that is simpler.)
Make the index visible and regular (i.e. remove its build-time DML logic about the filtering by the build key).

While sounding simple, it actually contains a lot of hidden icebergs in the engine waters. See the known ones below.

What happens if a txn statement greater than the build key was ignored, but then the index build process reached this key before the txn was committed?

The following build-key-related mix situations are possible:

A transaction changed a key > build-key. Then the index build went through that key (build-key became >= this key). Then the txn was committed.
A transaction changed a key > build-key. Then the index build went through that key. Then the txn was rolled back.
A transaction changed a key > build-key. And was committed or rolled back before the index build went through that key.

The situations 1 is bad, because the build process took the old tuple, since it was reading with read-confirmed isolation. And the new one was ignored. Now the index is outdated in this key.

The situation 2 is ok, because the new tuple wasn't seen, and will never be seen by anybody anyway, because it was rolled back.

The situation 3 is ok as well. The index, when reaches this key, will just take whatever was the last confirmed version.

Lets inspect situation 1 closer then. When blindly skipping statements > build key, the problem is that on commit we lose the new version of the key's tuple. Neither Memtx nor Vinyl notify all indexes of a space when something gets committed into it. In Vinyl the situation is even worse than in memtx, because if a statement was skipped for a certain vy_lsm, and it has no txv object, then on commit it won't be put into the LSM-tree's memory-level. It only happens for created txvs.

The only solution on the surface seem to skip the statements which are > build key, but install for each of them an on-commit trigger, right on the specific txn_stmt objects. The trigger would check if the statement at the moment of commit is already <= build key. And if so, it would be applied to the new index too.

Logically that should resolve the key-mix problem. But implementation-wise it creates new problems. See about that in the next point.

Vinyl doesn't use index vtab for DML and has deferred deletes.

In case of Memtx the on-commit trigger mentioned in the previous section would just provide the old and the new tuple, and they can be simply applied on the new index. Delete old, insert new.

In case of Vinyl it is never simple.

First issue is that when any DML happens on Vinyl spaces, they go through vy_delete(), vy_update(), vy_replace(), vy_upsert(). From the moment of entering those functions the code doesn't use index vtab. It is a problem, because now for some LSM-trees of a space (the in-progress ones) it becomes necessary to filter-out their DML. To apply changes <= build key, and set on-commit trigger for the others. This filtering being done at the index vtab level would just do nothing in Vinyl. Because Vinyl doesn't use struct index_vtab::replace.

In some form that also has to stay like this, because in Vinyl it is quite expensive to read the old tuple for each DML for things like updates. It is important that it is read once from PK (if read at all), and then used for updating all the LSM-trees of that space without reading them (if they are not unique).

"Delete" statement in Vinyl doesn't even read the old tuple ever, regardless of uniqueness. It remains unknown. Replace can also skip the old tuple lookup entirely (if no indexes are unique besides the primary one).

That means, not only Vinyl can't just define struct index_vtab::replace and be done with it, but also its statement's on-commit trigger sometimes just won't have old_tuple specified for deletes or replaces. It would only have request->key for delete and stmt->new_tuple for replace and insert. The only safe case is the update, but that doesn't matter.

That in turn means, that the index build has to have engine-specific code. So Vinyl's build-index-on-stmt-commit trigger could read one set of fields in txn_stmt and in Memtx it would just rely on old_tuple and new_tuple. Although it would be probably simpler than the index build code is right now anyway. For Memtx the trigger will be delete old + insert new. For Vinyl the trigger has to observe the txn_stmt->row to find out the statement type and re-apply it the same way as the other LSM-trees did. I.e. make a deferred delete, blind replace, etc.

The exact way of implementation, even if being turned down during the code review, could be redone relatively easy, if the main logic above is approved. Nonetheless, I could mention here some ideas to validate before the coding starts:

vy_delete() deferred deletes can actually be skipped by the new index just fine. Such deletes affect all indexes implicitly and would even affect the newly building index too.
Any place that does for (uint32_t i = 0; i < space->index_count; i++) must check if the index is being built. Such index would have a generic build context object of some sort + Vinyl-specific context. If such an index is met, then the DML is skipped (txv is not created for it), and it sets an on-commit trigger for the current txn_stmt.
When the trigger fires, it repeats the txn_stmt->row statement.
The statement is applied right away. If possible, without struct txn transactions at all. Directly to the memory-level of the LSM-tree. If not possible, then a temporary vy_tx transaction is created just for this one statement, and immediately committed. Without any WAL entries, of course. If just vy_tx is not enough, then an actual struct txn transaction can be fired.

And since this work of "statement catch ups" is pushed down to the engine-level, then Memtx would also have to define such thing for its own indexes. Although it must be much simpler than for Vinyl.

The index build process would then activate this "catch up" and then start the build.

nshy · 2025-04-11T14:07:26Z

nshy
Apr 11, 2025
Collaborator

Another issue with memcs is memory duplication. In case of strings for example if we use developed generic build then new index will have duplicates values for string columns because we get pk rows as tuples and then parse these tuples back to insert into new index. I guess we need to work out some interface to be able to pass rows without serialization/deserialization to tuples when iterating and inserting, so that we can use generic index build in memcs.

Currently memcs string are duplicated in indexes but we are going to fix this (https://github.com/tarantool/tarantool-ee/issues/1102).

This issue is similar to mentioned in https://github.com/orgs/tarantool/discussions/11018#discussioncomment-12642403. Yet wanted to write it down.

0 replies

Astronomax · 2025-07-13T18:32:18Z

Astronomax
Jul 13, 2025
Collaborator

This approach suffers from one annoying issue. The issue lies in how this RFC proposes to track duplicates:

If the space is actually unique, then set an on_replace trigger for that space. The trigger must do the following:
- For each txn statement check if it creates a duplicate. If it does, then create an on_commit trigger for that statement.
- The on_commit trigger must abort the new index build.

This approach cannot prevent the occurrence of duplicates when memtx/vinyl MVCC is enabled.

Imagine that the on_replace trigger is called for a tuple tuple. The proposal suggests checking which tuple currently exists in the building index for the corresponding key. One could directly look into the index to see the tuple currently in the tree for that key. However, with MVCC, this would be meaningless because it would only show the last "in progress" tuple for that key. What we really need is to see the last replace statement for that key (and the same for the primary key) that will occur before our tuple is inserted in the final serial sequence. Yet, this cannot be checked just once in the on_replace trigger because such a tuple is not yet determined. The on_replace trigger is called for an "in progress" transaction, and by the time this transaction reaches the "prepared" state (calling box_commit), several other "prepared" transactions may have replaced that key (or the primary one).

In MVCC, duplicate checking is also not performed in a single step. For the inserted tuple, a separate story is created and placed in the corresponding chains for each index. Then, for each "in progress" story story, an invariant is maintained: if this story were to transition to the "prepared" state right now and take its place in the serial sequence, no duplicates would arise. This invariant must be checked every time some story breaker becomes "prepared". At the moment a story breaker becomes "prepared", all "in progress" transactions for which the invariant is violated (victim) are rolled back with an ER_CONFLICTED error. Importantly, of course, not all "in progress" stories are checked for this — only those that could potentially violate the invariant (i.e., those in the same chain as the "prepared" breaker story in one of the indexes).

Thus, this approach, at the very least, requires the implementation of more complex mechanisms for the building index, similar to those used in MVCC.

3 replies

Astronomax Jul 14, 2025
Collaborator

To be more precise, the invariant maintained by MVCC is as follows:

If an inserted tuple x conflicts with a tuple y in some secondary index
(i.e., they have the same key in that index), then x must also conflict with y in the primary key.

For non-unique indexes, the secondary key format internally combines with the primary key format, which is why non-unique indexes allow duplicates. However, for unique indexes, the key format is not merged with anything else. It’s easy to see that in this case, the maintained invariant ensures the absence of duplicates for unique indexes.

However, for the index being built right now, a similar invariant would be difficult to maintain, because it simply does not yet have all the committed tuples. But all prepared and in-progress contains according to RFC proposal:

Abort all existing transactions for that space.

Sync WAL and the limbo to make sure that all the being committed txns also finish committing (because they can't be easily intercepted by the new index already).

Fortunately, we don't need to guarantee that duplicates won't occur during index construction. And we don't need to maintain this invariant. We do not want to roll back the preparing transaction when the invariant for the key being built is violated. All we want to guarantee is that there will be no duplicates in the unique secondary index being built at the time we declare the build successful. So we want to abort the index building, and we do not care that this may not happen immediately, but only when the pointer (called "primary build key" in RFC) iterating over the primary key reaches the corresponding committed tuple.

locker Jul 14, 2025
Maintainer

Imagine that the on_replace trigger is called for a tuple tuple. The proposal suggests checking which tuple currently exists in the building index for the corresponding key. One could directly look into the index to see the tuple currently in the tree for that key. However, with MVCC, this would be meaningless because it would only show the last "in progress" tuple for that key. What we really need is to see the last replace statement for that key (and the same for the primary key) that will occur before our tuple is inserted in the final serial sequence. Yet, this cannot be checked just once in the on_replace trigger because such a tuple is not yet determined. The on_replace trigger is called for an "in progress" transaction, and by the time this transaction reaches the "prepared" state (calling box_commit), several other "prepared" transactions may have replaced that key (or the primary one).

Can't we read the committed/confirmed (depending on the transaction isolation level) tuple and add it to the transaction read set (as if it were a normal read) so that if the tuple is replaced the transaction is rolled back with a conflict?

Astronomax Jul 14, 2025
Collaborator

Yes, it seems we could indeed do it this way—when inserting a tuple x, perform an isolated read of tuple y by its primary key, verify that x replaces y in the currently building secondary indexes, and then track the read for each of these building secondary indexes. Note that we need to track reads specifically for the building index, meaning we’ll have to maintain its own chains in some form. And if we maintain chains for the building index, we must consider what to do with committed tuples in the primary index that haven’t yet been added to our chain (i.e., the iterator hasn’t reached them yet).

However, this approach has a significant drawback—it would lead to a high number of false-positive aborts. This is precisely why MVCC doesn’t handle duplicates this way—it doesn’t track reads for every secondary index. Tuple y might be replaced by some other tuple in the serial sequence, yet inserting x afterward could still avoid creating duplicates. The proposed approach would, in this case, wrongly abort index building, which is a long-running operation, and the likelihood of such a scenario occurring could be non-negligible.

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view

This comment has been hidden.

Sign in to view