Fragmentos is a fragmentation system allowing to send large datagrams over protocols that support only small datagrams. Compared to traditional fragmentation systems, Fragmentos sends about twice more data while achieving improved delivery rate.
Given two nodes x and y on the internet, if x attempts to send a large
UDP datagram y, it is likely that the datagram will be fragmented, in the
lower IP layer, to a few IP packets. Those IP packets will be reconstructed
somewhere on a router on the way, or possibly at the endpoint y.
Traditional IP Packet fragmentation works by splitting a large IP packet into a few smaller packets. The problem with this method is that the original large packet is less likely to arrive at its destination, as the loss of any of the fragments will lead to the loss of the original large packet.
Thinking about a naive probabilistic model, if an IP packet sent from x to
y arrives at y with probability p, then a datagram which was fragmented
to two IP packets will arrive at probability p^2. If p = 3/4, then
q := p^2 = 9/16, which represents much lower odds for successful
delivery.
This document presents a different fragmentation method which increases the likelihood of successful delivery given the model discussed above. For example, Instead of splitting a large datagram into two packets, we create three shares of the datagram in a way that every two shares are enough to reconstruct the original datagram.
In this case, if x wants to send the datagram to y, x sends all the three
shares of the datagram to y. In the naive model described above where the
odds of successful delivery for one IP packet are p, we obtain that the odds
for successful arrival for the complete datagram are
q := p^3 + 3p^2(1-p). For p = 3/4, this will result in
q = 0.84375. Surprisingly in this case q > p.
Given a data chunk M, we want to split it to 2b-1 fragments so that every b
fragments are enough to reconstruct the original data chunk M.
Using information theoretic considerations we can
conclude that each share should be of size at least len(M) / b, so that the
reconstruction algorithm gets at least b*(len(M) / b) = b bytes of input.
To do this we will use the Reed-Solomon erasure
code.
Reed-Solomon erasure code allows us to build the following model:
Assume d shares of data of length l each. Reed-Solomon erasure code can
create additional e shares of data of length l each, such that if any e
shares of data are erased, it is possible to recover them. In other words, any
d shares of data are enough to reconstruct the rest of the shares.
The rust crate
reed-solomon-erasure
contains implementation of Reed-Solomon erasure code that allows to choose any
d and e we desire, as long as d + e <= 256. This is due to the limitation
of the order underlying field GF(256).
Given a data chunk M, we pick d = b and e = b - 1. We first divide it
into b fragments of equal lengths. (If this is not possible, we will need to
add some trailing padding bytes to M to make sure len(M) is divisible by b).
We then have b fragments of M. We create extra e = b-1 fragments using
the Reed-Solomon erasure code. We then end up with b + (b-1) = 2b-1
fragments. Every b fragments out of those 2b-1
We have b + (b-1) = 2b-1 fragments, where every b fragments allow to
reconstruct the rest of the b - 1 fragments, obtaining the original messages.
A message M from is sent from x to y by transmitting a few Fragmentos
messages. Each Fragmentos message is of the following form:
- messageId [8 bytes]
- b [1 byte]
- shareIndex [1 byte]
- shareData [variable amount of bytes]
- shortHash [8 bytes] (First 8 bytes of Sha512/256)
The actual data that we split to data shares is:
T := nonce8 || paddingCount || M || padding
Where || means concatenation. nonce8 is of size 8 bytes, paddingCount
is of size 1 byte.
The value paddingCount := (b - ((8 + 1 + len(M)) % b)) % b is the amount of 0
padding bytes at the end of the message M. Recall that we need padding bytes
at the end of M to make sure that the data we split to data shares, T, is
of size divisible by b. We use the value of paddingCount to be able to cut
the padding bytes at the receiving side.
nonce8 is a random 8 bytes nonce that is generated at the time of sending the
message M. It is used as means for the receiver end to distinguish between
identical messages sent during the same short period of time. For two identical
messages, different nonce8 values will generate different messageId-s.
messageId is obtained by calculating sha512/256 over T, taking only the first
8 bytes of the result.
b represents the amount of shares that are required to reconstruct the
original message. It is necessary that b \in [1, 128]. Any other value of b
is illegal. A message with illegal b value must be ignored.
shareIndex is the index of the current share. There are 2b - 1 shares,
therefore 0 <= shareIndex < 2b - 1. Any other value of shareIndex is
illegal. A message with illegal shareIndex value must be discarded.
shareData is the actual data of this share. Its length could be deduced by
subtracting the total size of all the other fields from the total size of the
datagram. It contains a share of the data T.
shortHash is the first 8 bytes of Sha512/256 over all the previous fields
in the message. It is used to verify the integrity of the message. If
shortHash is invalid, the fragmentos message is discarded.
We assume that we use a protocol where we can not send datagrams over n bytes.
We begin by calculating the maximum size of datagram we can send using the
Fragmentos system over the given protocol.
We can split a message to at most 256 shares, a restriction imposed by the
size of the field GF(256). Therefore 2b - 1 <= 256 and b < 128.
Every message share resides inside a Fragmentos message. The rest of the fields
in a Fragmentos message (messageId, b, shareIndex, shortHash) take 8 + 1 + 1 + 8 = 18 bytes. Hence we are left with n - 18 bytes for shareData.
Therefore we should be able to send a T data message of size at most
128*(n-18). Recall that T contains nonce8, paddingCount which take
together 8 + 1 = 9 bytes. This leaves d(n) := 128*(n-18) - 9 bytes for M.
If we use UDP as the underlying protocol for sending datagrams, we can pick as
an example the value n=512 as a safe size for a UDP packet. By safe we mean
that it is unlikely for the UDP packet to be fragmented by the lower IP layer.
As a result, we obtain d(n) = 128*(512 - 18) - 9 = 63223 bytes (close to
64KB) as the maximum possible Fragmentos datagram.
We assume the value n for the maximum size for a safe datagram in the
underlying datagram protocol.
Algorithm for sending a message M:
-
If
len(M) > d(n), the message is too large. We abort. -
Calculate
b := \lceil\frac{8 + 1 + len(M)}{n - 18}\rceil. -
Construct
T := nonce8 || paddingCount || M || padding.paddingshould contain enough\x00bytes so thatTwill have length that is divisible byb.paddingCountis set to the amount of padding bytes. -
Split
Tto2b-1data shares. -
For each data share of
T, create a Fragmentos message and send it to the destination. The message's fields will be filled as follows:messageId = sha512/256(T)[0:8]bshareIndexis the data share number.shareDatais the data of the share.shortHash = sha512/256(messageId || b || shareIndex || shareData)[0:8]
-
usedMessageIds: A set of messageIds that were recently processed. A messageId has to stay in this list for at leastPROCESSED_TIMEOUTseconds. Then it may be removed. -
curMessages: A dictionary for currently processed messages, having messageIds as keys. Every entry contains:- The value
b. shareLengh: length (in bytes) of a share data. (All shares should have exactly the same amount of bytes).- A set of the received data shares:
(shareIndex, shareData)
- The value
-
Make sure that:
shortHash = sha512/256(messageId || b || shareIndex || shareData)[0:8]. Otherwise, discard the message. -
If
F.messageIdis inusedMessageIds,usedMessageIds[F.massageId]is refreshed, so that it will stay anotherPROCESSED_TIMEOUTseconds before being cleaned up. The message is discarded. -
If there exists an entry
entrywith the keyF.messageIdinsidecurMessagesbutentry.b != F.borentry.shareLength != len(F.shareData), discard the message. -
If there is no entry with the key
F.messageIdinsidecurMessages, create a new entry withmessageId = F.messageId,b = F.bandshareLength = len(F.shareData). -
Given that the message was not discarded yet, denote by
entrythe relevant entry fromcurMessagesrelevant for our message. -
If
F.shareIndexis already present inentry, discard the message. -
Insert
(F.shareIndex, F.shareData)as a new data share toentry. -
If the amount of shares in the entry is less than
b, return. -
Add
messageIdtousedMessageIds. -
Remove
entryfromcurMessages. -
Reconstruct the shares to obtain
T := nonce8 || paddingCount || M || padding. -
If
sha256(T)[0:8] != entry.messageId: remove the entry fromcurMessagesand return. -
Extract
MfromTand return it as a received message.
-
If an
entryincurMessagesis older thanPROCESSED_TIMEOUTseconds, remove it and addentry.messageIdtousedMessageIds. -
If a
messageIdinusedMessagesIdsis older thanPROCESSED_TIMEOUTseconds, remove it.