com.gemstone.org.jgroups.protocols.DESIGN Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of gemfire-jgroups Show documentation
Show all versions of gemfire-jgroups Show documentation
SnappyData store based off Pivotal GemFireXD
$Id: DESIGN,v 1.2 2005/09/12 13:52:30 belaban Exp $
Design Issues
=============
Correct determination of left members (UNICAST and others)
----------------------------------------------------------
In UNICAST we have a connection hashtable which stores the sender's or
receiver's address and an entry for the connection to/from that
address. Whenever a view change is received, we remove members from
that connection table that are not in the view. However, this
introduces the following bug.
Consider the case where we have 2 members in the group: {A, B}. Now C
and D want to join simultaneously. Let's assume C is joined first,
then D (JGroups doesn't join more than 1 member at a time). This
will result in view {A, B, C} which is sent to A, B and C (not to D
since D is not a member yet). Let's also assume that there is already
an entry for D in A's connection table: if A is the coordinator, this
entry is the result of D's JOIN request sent to A. Let's assume D's
JOIN request has a seqno=1.
When A receives the new view {A, B, C}, it will remove D's entry in
its connection table BECAUSE IT THINKS THAT D HAS LEFT THE GROUP. Now
D resends its JOIN request (with seqno=2). Since this unicast message
is not tagged as first message, A will reject it. (Note that D does
not retransmit the message with seqno=1 because A ack'ed it. Receivers
ack all messages received even though the messages might not be
accepted by their connection table).
To solve this problem, we have to determine whether D really left the
group, or whether it is just a new member not yet seen by the
group. To do this, we maintain a 'group history' in a bounded cache
which contains the last n members of the group. E.g. when C joins the
cache is {A, B}. Before A removes D's entry from its connection table,
it checks whether D actually was a member. Since D is not in {A, B},
its entry will not be removed.
This has been modified (bela dec 23 2002):
Now we do not remove an entry from the connection table in
retransmit(), and only remove it in a VIEW_CHANGE when the diff
between 2 view shows that a member left (see determineLeftMember in
org.jgroups.util.Util). For example: first view is {A,B,C}, next
view is {A,B}, then left member is C.
The use case is when 2 or more members merge into a single group,
e.g. A and B. Let's say A is the merge coord. At this point A needs to
send a unicast message to B which is not member in A's view. Whenever
we received a view change or retransmission request in A or B, each
one would remove the other's connection entry from the connection
table, therefore preventing the merge protocol from completing.
Setting of recv/send buffer sizes in UDP (FRAG problem)
-------------------------------------------------------
There was a bug in FRAG which caused fragmentation of larger messages
to receive only the first fragment of each message, and not further
fragments. The reason was that the default send buffer size of UDP was
8195: when a larger message (e.g. 30000 bytes) was to be sent, it was
fragmented into 4 packets (3 * 8195 and 1 * 5415 bytes). All 4
fragments were then passed down to UDP almost simultaneously which
caused the UDP send buffer to overflow which resulted in fragments 2,
3 and 4 to be discarded. This explains why retransmission of fragments
always correctly received the packets. Also, if there was a small
timeout (e.g. 5ms) between the fragments passed down to UDP, *all* the
fragments would be received.
The solution is therefore to set the mcast_send_buf_size,
mcast_recv_buf_size, ucast_send_buf_size and ucast_recv_buf_size in
UDP explicitly, based on the frag_size value of FRAG.
Note that the above problem did not cause messages to be dropped, but
added a number of spurious retransmissions.
Use of multiple sockets in UDP
------------------------------
There are currently 3 sockets in UDP:
- a MulticastSocket for sending and receiving multicast packets
- a DatagramSocket for sending unicast packets and
- a DatagramSocket for receiving unicast packets
Normally there would only be 2 sockets: a socket for sending and
receving unicast packets and a socket for sending and receiving
multicast packets. The reason for the additional unicast send socket
is that under Linux, when a packet is sent to a non-existent
(e.g. crashed) receiver, there will be an asynchronous ICMP
HOST-UNREACHABLE sent back to the sender. However, since it is
asynchronous, it will not cause an exception in the current send(),
but in *one of the following ones* ! Therefore, we need to close and
re-open the send socket every time.
This feature is due to the fact that the older JDK for Linux from
Blackdown did *not* specify BSD compatibility when creating
DatagramSockets, which meant that ICMP HOST-UNREACHABLE errors were
actually progagated up to the interface (send()), and not just
discarded, as would be expected.
Newer JDKs (e.g. SUN's JDK 1.2.2 and 1.3 for Linux) don't seem to have
this feature, e.g. they seem to have BSD_COMPAT set to true). An item
on the todo list is to remove the code which closes and reopens the
send socket (search for is_linux in UDP.java) and test. This way, we
would have only 2 sockets and didn't need to close/reopen the socket
on every unicast send.
Why not merge all sockets into one, e.g. a MulticastSocket, for
sending/receiving unicast as well as multicast packets ?
--------------------------------------------------------
All MulticastSockets need to use the same IPMCAST address and port to
be able to receive multicast packets. However, in JGroups, every
member also needs to have its own address/port to receive unicast
packets. In this case, this wouldn't be true, because the
MulticastSocket only has one port, and that would be the IPMCAST
port. Therefore, returning a unicast response to a multicast request
would be impossible because senders cannot be discerned (they all have
the same port). Example: sock=new MulticastSocket("myhost", 7500). In
this case, sock.getLocalPort() returns 7500, which is true for *every
member in the group*.
Update (bela Sept 12 2006): even if we don't choose the mcast socket's port
(new MulticastSocket()), when sending on multiple
interfaces (send_on_all_interfaces="true"), we receive a different sender
address for each message sent on a different NIC. This needs to be handled with
*logical addresses* (see JIRA for 2.3).
Sending of dest_addr and src_addr in Message
--------------------------------------------
There was an idea of removing the src_addr and dest_addr from the
serialized data exchanged between UDP protocols, which would reduce
the size of packets. The idea was that the DatagramPacket itself would
contain the sender's address (src_addr), and the receiver's address
would be the one to which the packet was sent to. However, there are a
number of problems.
First, since we cannot have just one socket (see above), a receiver of
a unicast and a multicast packet by the same sender S would see 2
different sender addresses (assuming for now, that we have 1 multicast
and 1 unicast socket). Let's assume that the sender's port of the
unicast socket is 5555 and its multicast port 7500. The receiver would
receive a packet from S:5555 and one from S:7500, which are 2
different addresses although coming from the same sender.
Second, a multicast sent to a number of receivers cannot be replied
to: a receiver sees that the sender's port is 7500 and replies to that
port. However, that port is used by the multicast group, therefore the
sender of the multicast will not receive a unicast response.
Third, a member always needs a unique IpAddress as local
address. In the above case, we would always have 2 different
IpAddresses, which causes problems identifying the member. Note that
IpAddresses are used as keys into various structures, e.g. in
NakReceiverWindow.
UDP PacketHandler
-----------------
In the old design, unicast and mcast receiver threads first received
the message, then unserialized it and finally passed it on to the next
higher protocol. However, as unserialization takes a long time, the
thread spends too much time in its unserialization routine, whereas it
should be receiving messages. Therefore a lot of messages got dropped
(network buffer overflow) and had to be retransmitted.
The new design offers an alternative: the ucast and mcast threads just
receive a DatagramPacket and add it to a queue served by the
PacketHandler thread. Then they return to their task of receiving
messages. The PacketHandler thread continuously dequeues messages,
unserializes them and passes them on to the protocol above.
The advantage of the PacketHandler is that large volumes of messages
are better handled. The disadvantage is that we allocate a new byte
buffer every time a message is received and add it to the queue. In
the old design, there was one buffer for ucast and one for mcast
messages. A receiver thread would receive a message into that buffer
(65K), unserialize off of that same buffer and only after passing the
message up the stack would it return to receiving a new message into
the same buffer. So the advantage is not using up space (reuse of the
same buffer). The new scheme copies the bytes received (e.g. 425
bytes) into its own separate byte buffer and adds that buffer to the
packet handler's input queue, therefore using more memory, but being
faster overall.
To enable the PacketHandler set use_packet_handler to true in UDP.
Storing copies of Messages vs references of Messages
----------------------------------------------------
- Some protocols were changed May 26 2002 (by Bela) to store
references instead of copies of messages in retransmit tables. This
avoids accumulating large amounts of memory for copies that are not
needed. This change was possible by switching from stack-based to
hashmap-based headers in messages. When a message would have been
stored in a retransmit table by reference rather than copy, each
subsequent retransmission would have added headers to the
stack. With the hashmap-based headers, headers for a given protocol
are just overwritten by the header from that protocol.
- The caveat is that the buffer (payload), or source or destination
address should not be changed after the message is stored by
reference in the xmit table, because the original message will be
changed as well.
GMS-less reliable message transmission
--------------------------------------
Here are some more points to look at for GMS-less message transmission.
Looks like such a stack could be FRAG:UNICAST:NAKACK:STABLE. Let me look
at each protocol in turn.
FRAG:
- No changes; should work out of the box
UNICAST:
- The first message to a receiver is tagged with first=true. This allows
the receiver to determine the starting seqno
- When a member crashes we don't know about it. Therefore we need to
periodically clean the connection table of unused connection entries.
However, this could remove a connection to an idle (not crashed) sender.
In this case, when it sends the next message, and we don't find an
entry, we'll just return an error to the sender, so the seqno agreement
process starts again. This way, UNICAST is reliable in that we don't
lose any messages
- When a receiver crashes, we don't know about it. Therefore, if the
receiver hasn't acked all messages, we will continue retransmitting
those messages until the end of time. To prevent this, we will close a
connection after N unsuccessful retransmissions (without having received
an ack).
NAKACK:
- The first seqno from a sender causes the creation of a
NakReceiverWindow for that member (if it doesn't exist already)
- When a sender crashes, and the receiver hasn't received all missing
messages, the receiver will keep requesting retransmissions of the
missing messages from the sender (forever). Therefore, similar to
UNICAST, we purge senders from which we haven't received message
retransmissions for N attempts (or a certain time).
- Periodically we will purge idle connections. These may either be idle
members, or crashed members, but we have no means of detection. When an
idle member has purged, and sends the next message, a new entry will
simply be created for it.
STABLE:
- A hard one: how do we know when a message is stable in all receivers
so we can remove it from the retransmission table, when we don't have
membership information ? The simplest scheme is probably just to
periodically purge messages that are older than N seconds. If we
accidentally purge a message that has not yet been seen by all members
(small probability though), we just send an error back to the
retransmitter. The requester could then for example remove the
retransmitter's connection, so retransmission requests would stop.