[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
recognizing what you've read before
# Perhaps the EFF people would like to include a little header in
# their releases explaining the groups/lists which already
# receive the text automatically and explain the concept of
I've thought about automating this from the user end.
Define some characteristic signature for a paragraph, and some
way to recognize one inside a text file.
Here's my best approach. Only pay attention to the letters and
numbers [A-Za-z0-9]. Treat everything else as white space.
Use some kind of hashing or checksum to digest the body of
a paragraph. Ignoring punctuation and newlines lets you recognize
a paragraph even if it is quoted or re-fmt'ed.
Define paragraphs to recognize two different formats:
1. Lines with letters, delimited by lines without letters.
That will recognize the format I've used until now,
which I find most readable in email.
2. Lines that are indented more than the previous line
begin new paragraphs. That will recognize the paragraphs from
here on.
3. It would probably also help to recognize some important
things that are not paragraphs of readable text, such as uuencodes
and C source and unreadable PGP blocks.
The idea, of course, is to keep a database of paragraph
signatures that you have seen, and probably whether or not you
bothered to read it before. When a new message arrives, it can
be characterized like "18% new, 23% read before, 51% skipped before,
8% not text".
You still have the problem of finding truncated paragraphs
like the one I quoted at the top of this message.
Those could be recognized if you did lines instead of
paragraphs. It would take some experimentation to fine tune.
Finally, a mailing list itself could remember what has been
sent on it, and attempt to reject large messages of mostly
redundant paragraphs.
>strick<