com.gemstone.gemfire.internal.cache.package.html Maven / Gradle / Ivy

Go to download

Show more of this group Show more artifacts with this name
Show all versions of gemfire-core Show documentation

SnappyData store based off Pivotal GemFireXD

The newest version!





This package contains internal GemStone classes for implementating
caching on top of the GemFire Distributed System.

Local Regions
LocalRegion implements the basic caching mechanism and allows for
subclasses to perform message distribution and other specialization
of LocalRegion functionality. A LocalRegion is an implementation of the
jre Map interface that supports expiration, callbacks, server cache
communication, and so on.

LocalRegion has a RegionMap that 
holds the actual data for the region.

Most changes to an entry in a LocalRegion are performed in three steps:
* The entry is modified under synchronization using an instance of
EntryEventImpl object.  The event
is also queued for later callback invokation under this synchronization.

* Distribution is allowed to occur outside of synchronization

* Synchronization is again obtained on the entry and callbacks are invoked


A LocalRegion may also have a DiskRegion
associated with it for persistence or overflow to disk.


Distributed Regions
Distributed Regions are a subclass
of LocalRegion that interact with locking and the DistributedSystem to
implement distributed caching.  Most DistributedRegion operations are
carried out using subclasses of
DistributedCacheOperation.


Partitioned Regions

The contents of a partitioned region is spread evenly across
multiple members of a distributed system.  From the user's standpoint,
each member hosts a partition of the region and data is moved from
partition to partition in order to provide scalability and high
availability.  The actual implementation of partitioned regions
divides each partition into sub-partitions named "buckets".  A bucket
may be moved from one partition to another partition in a process called
"migration" when GemFire determines that the partitioned region's data
is not spread evenly across all members.  When a bucket reaches a
maximum size, it is split in two and may be migrated to a different
partition.

Data is split among buckets using the Extensible Hashing algorithm
that hashes data based upon the lower-order bits ("mask") of the
data's (the Region entry's key in the case of GemFire)
value.  All partitions of a given region share a
directory that maintains a mapping between a mask and
information about the bucket that holds data that applies to the mask.
When an entry is placed into a partitioned region, the bucket
directory is consulted to determine which member(s) of the distributed
system should be updated.  The Extensible Hashing algorithm is useful
when a bucket fills us with data and needs to be split.  Other hashing
algorithm require a complete rebalancing of the partitioned region
when a bucket is full.  Extensible Hashing, however, only requires
that the full bucket be split into two, thus allowing the other
buckets to be accessed without delay.  The below diagram demonstrates
bucket splitting with extensible hashing.





A BucketInfo contains metadata about a bucket (most
importantly the locations of all copies of the bucket) that is
distributed to members that access the partitioned region.  Changes to
the BucketDirectory metadata are coordinate through
GemFire's distributed lock service.  Inside of a region partition are
a number of Buckets that hold the values for keys that
match the bucket's mask as shown in the below diagram.





The total size (in bytes) of a bucket is maintained as key/value
pairs are added.  It is not necessary for the bucket to store the
value of a region entry as an actual object.  So, the bucket stores
the value in its serialized byte form.  This takes up
less space in the VM's heap and allows us to accurately calculate its
size.  The entry's key, however, is used when looking up data in the
bucket and must be deserialized.  As an estimate, the size of the key
object is assumed to the size of object's serialized
bytes.  When a entry's value is replaced via an update
operation, the size of the old value is subtracted from the total size
before the size of the new value is added in.  It is assumed that the
key does not change size.

When a bucket's size exceeds the "maximum bucket size", it is split
in two based on the extensible hashing algorithm: a new
Bucket is created and is populated with the key/value
pairs that match its mask, the Bucket's local depth is
incremented by 1, update the global depth if the new local depth
exceeds the current global depth.  The splitting process is repeated
while all of the following conditions are met: the size of either
bucket continues to exceed the "maximum bucket size", the full bucket
has more than 1 element, the global depth is less than the "maximum
global depth".

Primary Bucket

One bucket instance is selected as the primary. All bucket operations
target the primary and are passed on to the backups from the primary.

Identification of the primary is tracked using metadata in 
BucketAdvisor. The following diagram shows the standard
state transitions of the BucketAdvisor:





Partitioned Region Cache Listeners

User CacheListeners are registered on the PartitionedRegion. Activity
in the Buckets may fire callbacks on the PartitionedRegion's CacheListeners.
The following tabled figures attempt to demonstrate the logic and sequence 
involved.


  Definition of Participants
  pr_A1 a pure accessor
  pr_B1_pri a datastore which hosts primary for bucket B1
  pr_B1_c1 a datastore which hosts copy 1 for Bucket B1
  pr_B1_c2 a datastore which hosts copy 2 for Bucket B1
  pr_A2_listener
pr_A3_bridge
pr_A4_gateway pure accessors with CacheListener, Bridge or Gateway



  Fig. 1 (Flow of Put to CacheListeners)
  pr_A1 pr_B1_pri pr_B1_c1 pr_B1_c2 pr_A2_listener
pr_A3_bridge
pr_A4_gateway
  putMessage1 --> operateOnPartitionRegion() 
  sync entry 
    update entry 
    UpdateOperation.distribute 
      if bucket add adjunct.recips 
      send ----------------------> --> see Fig. 2
<------- reply 
      send ----------------------> ---------> --> see Fig. 2
<------- reply 
      if adjunct.recips > 0 
        PutMessage2 (notificationOnly == true) 
          send --------------------------> ---------> ---------> --> CacheListener fires on pr iif
InterestPolicy != CACHE_CONTENT
<------- reply
    waitForReplies (from all above msgs) 
    fire local CacheListener on pr 
  release entry sync 



  Fig. 2 (Processing of UpdateOperation by non-Primary Bucket Host)
  sync entry
    update entry
    CacheListener fires on pr iif InterestPolicy != CACHE_CONTENT
  release entry sync
  reply

  
Migration

Buckets are "migrated" to other members of the distributed system
to ensure that the contents of a partitioned region are evenly spread
across all members of a distributed system that wish to host
partitioned data.  After a bucket is split, a migration operation is
triggered.  Migration may also occur when a Cache exceeds
its maxParitionedData threshold and when a new member
that can host partitioned data joins the distributed system.  Each
member is consulted to determine how much partitioned region data it
is currently hosting and the com.gemstone.gemfire.cache.Cache#getMaxPartitionedData maximum amount of
partitioned region data it can host.  The largest bucket hosted by the
VM is migrated to the member with the large percentage of space
available for partitioned data.  This ensures that data is spread
evenly among members of the distributed system and that their space
available partitioned region data fills consistently.  Migration will
continue until the amount of partitioned data hosted by the member
initiating the migration falls below the average for all members.
When a member that hosts partitions {@linkplain
com.gemstone.gemfire.cache.Cache#close closes} its Cache,
the partitions are migrated to other hosts.

High Availability

The high availability (
com.gemstone.gemfire.cache.PartitionAttributes#getRedundancy
redundancy) feature of partitioned regions effects the implementation
in a number of ways.  When a bucket is created, the implementation
uses the migration algorithm to determine the location(s) of any
redundant copies of the buckets.  A warning is logged if there is not
enough room (or not enough members) to guarantee the redundancy of the
partitioned region.  When an entry is put into a
redundant partitioned region, the key/value is distributed to each
bucket according to the consistency specified by the region's scope.
That is, is the region is DISTRIBUTED_ACK, the
put operation will not return until it has received an
acknowledgment from each bucket.  When a get is performed
on a partitioned region and the value is not already in the
partitioned region's local cache, a targeted netSearch
is performed.  When there are redundant copies of the region's
buckets, the netSearch chooses one bucket at random from
which to fetch the value.  If the bucket does not respond within a
given timeout, then the process is repeated on another randomly
chosen redundant bucket.  If the bucket has been migrated to another
member, then the member operating on the region will re-consult its
metadata and retry the operation.  When redundant buckets are migrated
from one machine to another, the implementation is careful to ensure
that multiple copies of a bucket are not hosted by the same member.



System Properties

All of the system properties used by GemFire are discussed
here.

Definition of Participants
pr_A1	a pure accessor
pr_B1_pri	a datastore which hosts primary for bucket B1
pr_B1_c1	a datastore which hosts copy 1 for Bucket B1
pr_B1_c2	a datastore which hosts copy 2 for Bucket B1
pr_A2_listener pr_A3_bridge pr_A4_gateway	pure accessors with CacheListener, Bridge or Gateway

Fig. 1 (Flow of Put to CacheListeners)
pr_A1	pr_B1_pri	pr_B1_c1	pr_B1_c2	pr_A2_listener pr_A3_bridge pr_A4_gateway
putMessage1 -->	operateOnPartitionRegion()
	sync entry
	update entry
	UpdateOperation.distribute
	if bucket add adjunct.recips
	send ---------------------->	--> see Fig. 2 <------- reply
	send ---------------------->	--------->	--> see Fig. 2 <------- reply
	if adjunct.recips > 0
	PutMessage2 (notificationOnly == true)
	send -------------------------->	--------->	--------->	--> CacheListener fires on pr iif InterestPolicy != CACHE_CONTENT <------- reply
	waitForReplies (from all above msgs)
	fire local CacheListener on pr
	release entry sync