org.infinispan.scattered.package-info Maven / Gradle / Ivy
/**
* SCATTERED CACHE DESIGN
*
* Idea
*
* Distributed caches have fixed owners for each key. Operations where originator is one of the owners
* require less messages (and possibly less roundtrips), so let's make the originator always the owner.
* In case of node crash, we can retrieve the data by inspecting all nodes.
* To find the last-written entry in case of crashed primary owner, entries will keep write versions
* in metadata. These versions also determine order of writes; we don't have to use locking anymore.
*
* Pros:
*
* - faster writes
*
- no locking contention
*
*
* Cons:
*
* - reads always have to go to the primary (slower writes in small clusters)
*
- complex reconciliation during rebalance
*
*
* Scope of implementation
*
* - Scattered cache is implemented as resilient to single node failure (equivalent to 2 owners
* distributed cache).
*
- Transactional mode is not implemented yet.
*
- Functional commands are not fully implemented yet.
*
- All other features (state transfer, distributed streams, persistence...) should work.
*
*
* Operations
*
* We need to keep tombstones with versions after entry removal. These tombstones have limited lifespan -
* we keep them around only until the invalidations are applied on all nodes.
*
* The versions have to grow monotonically; therefore the version counter won't be per-entry but per segment
* (as tombstone will be eventually removed, per-entry version would be lost). The version is implemented by
* {@link org.infinispan.container.versioning.SimpleClusteredVersion} and therefore it contains topology id.
*
* Unlike other cache modes, entry commit does not happen in {@link org.infinispan.interceptors.impl.EntryWrappingInterceptor}
* but before replication to backup in {@link org.infinispan.interceptors.distribution.ScatteredDistributionInterceptor}
* (see below for the detailed operation descriptions). As scattered cache synchronizes only on the data container
* (instead of using locking interceptors), the value in data container can change between loading that in
* {@link org.infinispan.interceptors.impl.EntryWrappingInterceptor} and committing it. Therefore, for command
* that reads previous values according to {@link org.infinispan.commands.write.WriteCommand#loadType()}
* the version seen before modification is checked against actual data-container value and if it does not match,
* {@link org.infinispan.interceptors.distribution.ConcurrentChangeException} is thrown. This is caught in
* {@link org.infinispan.interceptors.impl.RetryingEntryWrappingInterceptor} and the command is retried in that case.
*
*
Single entry write (put, getAndPut, putIfAbsent, replace...)
*
* originator == primary owner
*
* - Primary increments version for segment
*
- Primary commits entry
*
- Primary picks one node (next member in CH) and sends backup RPC
*
- Backup commits entry
*
- Backup sends RPC response
*
- Primary returns after it gets the response
*
- Primary schedules invalidation of entry with lower versions
*
*
* Selection of backup could be random, but having it ~fixed probably reduces overall memory consumption
*
* Updating value on primary before backup finishes does not change data consistency - if backup RPC fails
* in distributed cache we can't know whether backup has committed the entry and so it can be published anyway.
*
*
originator != primary owner
*
* - Origin sends sync RPC to primary owner
*
- Primary increments version for segment
*
- Primary commits entry
*
- Primary returns response with version (+ return value if appropriate)
*
- Origin commits entry
*
- Origin schedules invalidation of entry with lower versions
*
*
* Invalidation must be scheduled by origin, because primary does not know if backup committed
*
*
Single entry read
*
* originator == primary owner
* Just local read
*
* originator != primary owner
* ATM just invoke sync RPC to the primary owner
*
* Possible improvement (not implemented yet)
*
* - Origin locally loads entry with SKIP_CACHE_LOAD
*
- Origin sends sync RPC including the version to primary
*
- Primary compares version with it's own
*
* - If version matches, origin gets just successful response and returns locally-loaded value
*
- If version does not match, value + version is sent back
*
*
*
* Optional configuration options:
*
* - Allow reading local values only (if present) - risk of stale reads
*
- Store read value locally with expiration (L1 enabled) - as invalidations are broadcast anyway,
* there's not much overhead with that. This will still require RPC on read (unless stale reads
* are allowed) but not marshalling the value.
*
*
* Multiple entries writes
*
* - Increment version for primary-owned entries and commit them
*
- Backup these entries to next node
*
- Send all other entries to their primary owner
*
- Commit entries after successful response from primary
*
*
* Possible improvement (not implemented yet):
*
* Entries for which this node is the primary owner won't be backed up just to the next member,
* but to a node that is primary owner of another entries in the multiwrite. That way some messages
* can be spared by merging the primary(keyA) -> backup and origin -> primary(keyB) requests.
*
*
Multiple entries reads
*
* Same as single entry reads, just merge RPCs for the same primary owners.
*
* Invalidations
*
* It would be inefficient to send invalidations (key + version) one-by-one, so these are be merged
* and sent in batches, using {@link org.infinispan.commands.write.InvalidateVersionsCommand}.
*
* Possible improvement (not implemented yet):
*
* The list of invalidations-to-be-sent could be updated when we get invalidation from another node, in order
* to reduce the overall noise.
*
*
State Transfer
*
* During rebalance, scattered cache always uses pendinCH for both reads and writes. It does not implement four-phase
* rebalance as the segment state and ability to read/write on a node is tracked in {@link org.infinispan.scattered.ScatteredVersionManager},
* we use only two-phase rebalance.
*
* When the command traverses through interceptor stack {@link org.infinispan.interceptors.impl.PrefetchInterceptor}
* checks the segment state, and either retrieves the remove value (ahead of regular state transfer) or blocks
* the command until the state transfer is finished (for commands which need all values - there's no need to start
* a second retrieval of all values).
*
* The state transfer of a segment has several phases:
*
* - NOT_OWNED: this node is not primary owner, it can backup the entry, though
*
- BLOCKED: node has just become an owner but the old owner did not revoke segment ownership yet
*
- KEYS_TRANSFER: node knows what is the highest version for given segment and is requesting
* keys + versions (no values) from all other nodes.
*
- VALUES_TRANSFER: we got all keys with metadata and now store the highest version of each key
* and the node storing the value in {@link org.infinispan.container.entries.RemoteMetadata}
*
- OWNED: all data is in
*
* There are new commands:
*
* - {@link org.infinispan.commands.statetransfer.ScatteredStateConfirmRevokedCommand} that makes sure that all
* old owners have adopted the new topology and won't serve furher requests according to the old one.
*
- {@link org.infinispan.commands.statetransfer.ScatteredStateGetKeysCommand} that is very similar to
* {@link org.infinispan.commands.statetransfer.StateTransferStartCommand} but moves only keys.
*
*
* During node crash, we experience 3 topologies:
*
* - CH_UPDATE just removing the dead member (STABLE topology)
*
- REBALANCE starts shuffling entries around (TRANSITORY topology)
*
- CH_UPDATE with final (STABLE) topology
*
*
* Operations are always driven by the new primary owner of given segment.
*
*
If the segment has not changed an owner:
* {@link org.infinispan.scattered.ScatteredStateProvider} does:
*
* - Replicate all data from this segment to the next node using
* {@link org.infinispan.statetransfer.OutboundTransferTask#pushTransfer} {@code true}
*
- Send {@link org.infinispan.commands.write.InvalidateVersionsCommand}s with all keys in this segment to all
* nodes but the next member (receiving state via the push transfer)
*
* Write to entry can proceed in parallel with this process; invalidation cannot overwrite newer entry,
* though invalidation from executed write can arrive to the new backup before the state-transfer - then
* the cluster would have 3 copies of that entry until next write as the entry would not be invalidated on backup.
*
* If the segment just got a new primary owner:
* {@link org.infinispan.scattered.impl.ScatteredStateConsumerImpl} does:
*
* - Synchronously retrieve highest version for this segment from all nodes (using {@code GET_MAX_VERSIONS}
*
- Request all nodes to send you all keys + versions from this segment (and do that locally as well)
*
- Retrieve values from nodes with highest versions
*
- Send invalidations to all other nodes, removing the entry
*
*
* Clean rebalance (after join, no data is lost)
*
* Optimization for rebalance when there's single owner with all data (previous primary) has not been implemented yet.
*
* Partition handling
* Partition becomes degraded any time it loses more than one node compared to members in last stable topology.
* In degraded mode, all operations are prohibited; one partition cannot have all owners in (in that case operations
* are allowed in distributed caches) because we don't know who is the backup owner. Having primary owner
* is not sufficient; the other partition may be still available and therefore we would get inconsistent/provide
* possibly stale data.
*
* Persistence
* As we don't use locking for everything after {@link org.infinispan.interceptors.impl.EntryWrappingInterceptor}
* we need another synchronization for storing an entry into cache store. We don't want to block data-container
* for the potentially long cache store update, and therefore {@link org.infinispan.interceptors.impl.ScatteredCacheWriterInterceptor}
* goes into data-container (getting the lock) just to compare versions and create a {@link java.util.concurrent.CompletableFuture}
* that serves as a lock that can be waited upon in non-blocking way.
*
*
*
Potential problems
*
* Listeners
*
* - The pre- listeners may be invoked multiple times, with stale values (the command then does
* not update DC, and retries).
*
- However if command itself does not read the value, it can commit even if the value changed
* in between and listener will get out-of-date value.
*
- As ordering updates to DC is based on the versions, it is possible that some operations arrive to DC finding
* that a newer (according to version) update has been applied there. In that case, the operation correctly
* finishes, but an event for this update is not fired as we don't have the previous value, and the event that
* was fired for the newer update carries the value before this update.
*
*/
package org.infinispan.scattered;