Abstract
Email communication is widely spread and essential nowadays. However, the threat of unsolicited junk emails, also known as spam, becomes more and more serious. The basic idea of the similarity matching schema for spam detection is to maintain a known spam database, formed by user feedback, to block subsequent near-duplicate spam. By achieving efficient similarity matching and reducing storage utilization, prior works mainly represent each email by a succinct abstraction derived from email content text. But, these abstractions of emails cannot fully catch the evolving nature of spam, and are thus not effective enough in near-duplicate detection. An email abstraction scheme is proposed, which considers email layout structure to represent emails. Procedure SAG(Structure Abstraction Generation) is presented to generate the email abstraction using HTML content in email, and this newly-devised abstraction can more effectively capture the near-duplicate phenomenon of spam. Moreover, we design a complete spam detection system which possesses an efficient near-duplicate matching scheme and a progressive update scheme. The progressive update scheme enables this system to keep the most up-to-date information for near-duplicate detection