Email spam is a not just a problem but more a menace with the increasing number of spammers that are active. Detection of spam mails from a collection of legitimate mails is a very important data classification problem. However, the impact of classifying a legitimate email as spam is more than filtering spam. Legitimate mail classified as spam is known as ham and is the result of improper classification. Hence, the threshold is selected in such a way as to avoid false negatives.
Various features of the mail such as the senders mail id (black listed or not), content of the mail in terms of occurrence of words common to spam mail like 'free', 'buy' etc are used to judge if a mail is spam or not.
Email Spam – detection and anti detection methods
Server side filter out spam such as spam assassin make use of rules which classify mail into spam and email based on occurrence of different types of words and features. It makes use of a neural network algorithm based method to do the classification. Spam Bayes provides tools for desktop utilities such as outlook, gmail, yahoo POP3 and IMAP and many other popular email clients. It is based on stasticial analysis of the email content.
The methods which rely on content based classification of spam have been very effective as the spammer has to deliver his spam message whose content is very different from a legitimate mail in many ways.
The empire fights back
The spam empire has fought back with many changes. Tools such as Spam checker will check the mail and suggest synonyms or changes to the mail to make it look less like spam and more like legitimate mail. Although the spammer can make such changes he cant make the mail ridiculously complex and incomprehensible. The new rules and new synonyms to escape the rules is an ongoing battle between spam and spam detectors. Gmail uses the same content based approach to decide what are relevant ads for that particular mail user.
Using images in place of the words that might give them away has been a popular method among spammers to avoid detection by such content methods. Having to recognize the characters and words in the images and checking them for spam is the obvious solution. This seems to be a never ending battle.