[e2e] Re: [Tsvwg] Really End-to-end or CRC vs everything else?

Mon Jun 11 22:38:55 PDT 2001

Christian,

That is an interesting point of view. You are implicitly assuming that
applications using the transport are bounded (like the backup) and a "total
checksum" can be applied on them

What about those that don't fit this model?

What about continuously running applications?

Should your multinational bank run its applications with checksums on every
transaction?  How will they react on failures?
Reinvent a new recovery mechanism for every application? (in the ips WG we
ran exactly through this and it was/is a painful experience).

I think that the implications of an "imperfect" transport (in the error
detection sense) are more extensive than applying another checksums and the
cost to transport users (complexity, footprint, performance) is excessive.

For most of the data processing applications I think that transport users
would welcome a more resilient error detection mechanism built into the
transport.

Julo

"Christian Huitema" <huitema at windows.microsoft.com> on 12-06-2001 04:57:35

Please respond to "Christian Huitema" <huitema at windows.microsoft.com>

To:   "Douglas Otis" <dotis at sanlight.net>, "Jonathan Stone"
      <jonathan at DSG.Stanford.EDU>, "Craig Partridge" <craig at aland.bbn.com>,
      "David P. Reed" <dpreed at reed.com>
cc:   tsvwg at ietf.org, "end2end" <end2end-interest at postel.org>
Subject:  RE: [e2e] Re: [Tsvwg] Really End-to-end or CRC vs everything
      else?

What is at stake here is whether we want a safety belt or an alarm bell.
The current TCP checksum is the latter. It will detect "most" errors
with a good probability, which is enough to detect a faulty board. It
will not provide sufficient protection for guaranteeing a complete
absence of glitches in the 5 terabyte backup of a multinational
commercial bank. Note that, if you really follow the e2e argument, this
is OK: the only way to be certain that the back-up went well is by
computing a strong checksum over the whole volume, not by trusting TCP.
In fact, it is all a matter of probabilities and arbitrations. The
transmission system should be good enough that the backup is OK most of
the time, so that the e2e checksum (volume) only fails rarely, so that
the cost of correction is acceptable...

-- Christian Huitema

> -----Original Message-----
> From: Douglas Otis [mailto:dotis at sanlight.net]
> Sent: Monday, June 11, 2001 4:41 PM
> To: Jonathan Stone; Craig Partridge; David P. Reed
> Cc: tsvwg at ietf.org; end2end
> Subject: RE: [e2e] Re: [Tsvwg] Really End-to-end or CRC vs everything
> else?
>
> Jonathan,
>
> You can't be suggesting a simple summation is worth using in the face
of
> router memory errors.  You have detected these in the wild and noted
their
> positions within a 32 word.  You have statistics that indicate there
are
> more bits in error than others suggesting there may be a weak bus
driver
> being seen.  From the simple tests that I have run, a simple summation
is
> extremely weak in this area.  A CRC still does well even when the
entire
> packet is corrupted including the CRC itself.  There is no need for
the
> CRC
> to be affected to improve the performance of the algorithm.  I can say
> that
> Fletcher-16 2^n should be avoided altogether due to this extremely
weak
> memory bus performance.  It is no where near 2^32 in preformance.  It
is
> closer to 2^6.  The only reason for placing a mandatory chunk at the
end
> of
> the packet would be to ensure against truncation which a good check
should
> catch.  If there is only one checksum type allowed, then placing this
at
> the
> end of the packet has the advantage of minimizing the potential passes
> this
> packet needs in preparation.  Adler-32 suffers the Fletcher problem if
the
> packet is small or mostly zero.
>
> Doug
>
>
> > To: Craig Partridge; David P. Reed
> > Cc: tsvwg at ietf.org; end2end
> > Subject: Re: [e2e] Re: [Tsvwg] Really End-to-end or CRC vs
everything
> > else?
> >
> > In message <200106112048.f5BKmpF07926 at aland.bbn.com>Craig Partridge
> writes
> > >
> > >In message <5.1.0.14.2.20010611143202.0462bec0 at mail.reed.com>,
> > "David P. Reed"
> > >
> > >writes:
> >
> >
> >
> > >I think you've missed the point.  In a prior note, you suggested
> > a line of
> > >thinking of assume an adversary.  Implicitly, that's an error
model.
> > >
> > >So what if traffic doesn't match that error model -- that is to
> > say, errors
> > >are not ones an adversary would pick -- then the checksum chosen is
the
> > >wrong one.
> >
> > Craig,
> >
> > To be fair, there are several points kicking around here, and being
> > made to (or intended for) different audiences.  The tsvwg folks are
> > under time pressure to (re-)decide on a check function.  The e2e
> > folks can take a longer, or at least a middle-vision, view.
> >
> > The way I'd put it to the tsvwg folks choosing a checksum is this:
> > if you start appealing to catching all but 1 in 2^32 errors (or
> > more accurately, 1 in 65521^2), then you have fallen into a fallacy.
> >
> > You have just conflated a purely combinatoric result, about the
ratio
> > of sizes of the domain and range of the error-check function, with a
> > probabilistic statement about how likely *in practice* you are to
> > catch errors.  What should give you a serious wake-up call,is to
hear
> > that even the constant function -- some constant 32-bit integer--
will
> > catch the same fraction of all errors.  There's no grounds for
> > labelling *any* function (in the mathematical sense) as stronger
than
> > another, unless we know something about the distribution of the
> > errors, or how well some particular function does against some
> > particular distribution.    CRCs are not any stronger  than
> > checksums, *unless* we happen to know that the distribution of
> > acutal errors tends to favour low Hamming-weight errors.
> > (the data I and Craig have, is that it doesn't.)
> >
> >
> > The point to Dave Reed is that the combinatoric argument is very
> > general and applies to any function, whether the constant function,
or
> > a cryptographic hash, or a shared-secret key, provided we define
> > "error check bits" to properly measure the fraction of bit
> > combinations which are accepted, versus those which are
> > rejected.
> >
> >
> > One further thing I've mentioned in email is that I recently
> > re-analyzed the captured error datasets which I and Craig and Vern
> > gathered, and I did find one pattern which could be exploited here.
> >
> > The pattern is that the errors we found can be broadly characterized
> > into two classes: either single-bit or short, low-hamming weight
> > errors; or as errors where some prefix of the packet is bad; the
> > packet is subjected to an error; and the error continues all the way
> > to the end of the packet. The ratio of errored bits within that
> > damaged `tail' of packet is very close to 0.5.
> >
> > That suggests an error model where we model packet-level errors as
due
> > to either signle-bit errors, memory-readout errors which affect a
> > single word or cache line; or due to `stateful' errors in the
> > hardware/software finite-state engines which move packets between
> > packet- buffer memory, and the hardware which implements some
specific
> > media layer. (think of errors due to an under-run in a hardware
FIFO,
> > or a bad bit in a DMA pointer register.)
> >
> > There's two things to take away from that.  The first is that the
> > errors we've acutally observed, in the only study of in-the-wildn
> > packet-level errors I know of, the errors are so heavy that, on
> > average, they affect more than R bits, for any R that's a plausible
> > error-check. That says we're only going to catch errors
stochastically.
> > The second is that, since the errors seem to be stateful, putting
the
> > error-cheeck information at the end of the packet rather than in a
> > fixed header field doesn't hurt, and (for the reasons we analyzed to
> > death in our 98 ToN paper) will acutally help, for the kinds of
> > nonuniform data we find in filesystems.
> >
> > If there's anything i can recommend to the tsvwg, its to pick even a
> > 32-bit extension of the TCP checksum, rather than Adler32; and to
> > think seriously about moving the error-check bits to the end.  Not
to
> > help hardware, but to make whatever error-check you use more
resilient
> > against errors in packet-processing engines which (once an error
does
> > hit) trash the remainder of the packet.
> >
> >
> > _______________________________________________
> > tsvwg mailing list
> > tsvwg at ietf.org
> > http://www1.ietf.org/mailman/listinfo/tsvwg
> >