[e2e] latest spate of cruft postings to e2e
Vernon Schryver
vjs at calcite.rhyolite.com
Thu Nov 6 21:09:30 PST 2003
> To: Vernon Schryver <vjs at calcite.rhyolite.com>, end2end-interest at postel.org
> From: "David P. Reed" <dpreed at reed.com>
> <html>
> <body>
> At 04:40 PM 11/6/2003, Vernon Schryver wrote:<br>
> <blockquote type=cite class=cite cite> - keyword and other scoring
> filters including so called "Bayesian"<br>
> systems<br>
> Except for some individuals and for them
> only some of the time,<br>
> these have non-trivial false positive
> rates.</blockquote><br>
> I use an excellent open-source Bayesian filter, called POPFile (see
> sourceforge). It's long-term accuracy for classifying
> messages into 25 personally invented buckets (including e2e messages) is
> displayed as follows:
> <dl>
> <dd><h2><b>Classification Accuracy</b></h2>
> <dd>Messages classified: 75,001
> <dd>Classification errors: 301<hr>
>
> <dd>Accuracy: 99.59%<br>
>
> <dd>Bucket Classification Count False Positives False
> Negatives
> <dd>...
> <dd><font color="#FF0000">spam</font>
> <x-tab> </x-tab>51,953 (69.26%)
> <x-tab> </x-tab>205
> <x-tab> </x-tab><x-tab> </x-tab>81
> <dd>...
> </dl>It takes me about 15 seconds to scan a folder of 100 messages that
> are classed as spam to detect these false positives, and the false
> negatives are of course less of a problem. A 0.2% false
> positive rate is quite reasonable. Note that I have deliberately
> resisted using POPFile's whitelist capability - I ONLY use the Bayesian
> learning filter.<br><br>
> The advantage, of course, is that what I consider to be spam is a purely
> personal decision, which is Joe Touch's point - it's a very bad idea to
> impose a notion like "solicitation" as a criterion for
> rejecting stuff. Email is by definition unsolicited, in
> almost all instances. The Nobel Prize phone call is equally
> unsolicited. Perhaps you don't want to get it, but I'd prefer
> to have the choice to be given my Nobel, thank you.<br><br>
> </body>
> </html>
As far as I can tell from my manual decryption of that missive, it:
- misrepresents my position.
- makes the incredible claim that all mail is unsolicited.
Some mail is unsolicited but a lot of mail is solicited by any
common definition of the word.
- No informed person considers all unsolicited mail to be spam for
most people. That notion is generally the domain of kooky spammer
fighters. The consensus definition of spam is unsolicited bulk email.
- of course spam is a personal matter. That's the point of and
a major implication of "solicited" in the consensus definition.
The "person" involved here is the charter of the list represented
by the human running the list. I've tried to unsubscribe from
this list because Joe's personal notion of spam differs from mine.
- I do not believe a 0.2% false positive rate or anything less than
a few % over the long haul for a Bayesian filter. I've investigated
more than one or two such claims about Bayesian filters. They
have all turned out to carry caveats like "but of course that was
after I trained it for 3 months and doesn't count the mail I look
at to update the training." Any mail your filter requires you to
examine, no matter in which "bin" or "folder," is either not really
filtered or must be counted as false positives or negatives.
- False positive rates of less than 0.1% are humanly impossible for
spam loads above a gross or two spam/day for manual examination
even if you spend 1.5 minutes/100 spam instead of 15 seconds.
Unless your job consists entirely of reading your spam load, you
will miss some legitimate mail among 100s of spam per day.
- Some of those Nobel messages are unsolicited, but the majority are
in fact solicited by the English meaning of the word. If you have
any hope of hearing from the King of Norway and if you really think
the messges would be substantially identical to a lot of other
messages, then you ought do some whitelisting to document your
preference/solicitation even if you use a Bayesian system. It is
humanly impossible to filter a Nobel invitation from among 100
messages in only 1.5 seconds/message with better than 99% accuracy.
- That missive above misstates the situation with this mailing list.
There's no telling what messages Joe Touch's filters are rejecting,
but he has made clear that he is rejecting some. The spam that
passes his filters is much less than you would expect given the
wide distribution of the submission address. Joe has probably
already caused you to miss hearing from whichever Nobel Committee
tried to contact you via the E2E list. (That's sarcasm.) (saying
it's sarcasm is intended to be insulting.)
- HTML mail is the single biggest enabler of spam on the net. Everyone
who sends HTML mail to strangers should lose the privilege of
sending mail to strangers for one day for each unjustified HTML
mail messages sent to a stranger. Everyone responsible for making
HTML the default configuration of an MUA should be forced to receive
only spam for the rest of the decade.
As I said, I've unsubscribed from the list. I've no interest in
receiving uninformed and unthinking commentary on spam. I've been
playing the anti-spam game since before Spamford. My anti-spam system
handled more than 63 million mail messages in the 24 hours ending
midnight GMT. That this list continues to distribute so much spam
while other equally open mailing lists don't says all that needs to
be said about Joe's understanding of his understanding of spam. That
David Reed claims all mail is unsolicited implies at best that he and
I don't share a common language.
I also have no interest in attempts to elevate what was an execellent
engineering insight into a religion complete with an elder priesthood
that makes inscrutable oracular pronouncements on what the dogma really
means that would have done the priestesses of Delphi proud.
Vernon Schryver vjs at rhyolite.com
P.S. If Joe would do what's reasonable and very common, namely
manually examine and filter submissions from non-subscribers,
only David Reed and he would see this diatribe.
More information about the end2end-interest
mailing list