Writing UserTests for AntiSpam (3)

Harriet Bazley

In this final article I shall give examples of some of the more interesting things one can do when programming user tests. Remember: if you can describe a given type of spam in a logical statement, then you can test for it.

Using header$

Many people will have observed that a certain type of spammer likes to write his subject lines ALL IN CAPITALS, presumably because he hopes to attract the reader’s attention. Although it’s obvious to the human eye, there is no way of testing for this sort of thing using the Rules provided by AntiSpam — not least because all keywords are forced to lower-case before comparing them, to ensure that ‘*debt*’ will catch all of ‘Debt’, ‘debt’ and even ‘dEbT’. However, this is easy enough to do as a user test. AntiSpam even provides a facility almost tailor-made for the purpose, in the form of the additional FNUserTest_DoTest parameter header$.

You may recall that FNUserTest_DoTest has four parameters, only the first two of which are normally used: kw$, data$, header$ and mbox%. mbox% is, of course, the mailbox number — allowing you, for example, to apply a given set of tests only to incoming email from one ISP — but header$ is the complete text of the original header line as received by AntiSpam, keyword, case and everything. You can use this to test for leading spaces at the start of the subject data, normally stripped before data$ is passed, or for telltale malformed headers; but its principal use is for checking capitalisation.

Testing for capitals

Since we’re working from Basic, we are going to let the operating system do all the work for us by making a system call at the start of the program to “Territory_UpperCaseTable”. The Territory module in RISC OS handles such problems as comparing strings containing accented vowels, and Territory_UpperCaseTable is an OS routine which returns the address of an area within this module containing all the 256 ASCII characters in uppercase:

            ASCII Data                             
 ................................                  
  !"#$%&'()*+,-./0123456789:;<=>?@
 @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
 `abcdefghijklmnopqrstuvwxyz{|}~
 €‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ 
  ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿À
 ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßà
 àáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ

This allows us to compare the character we are actually looking at with what would be its uppercase equivalent, and hence tell if it is a lowercase letter or not. We assign the address of this table to a global variable user_uppercase%:

DEF PROCUserTest_Initialise
SYS"Territory_UpperCaseTable",-1 TO user_uppercase%
ENDPROC

This can now be accessed from within a home-written procedure, PROCuppercase.

DEF PROCuppercase(name$,RETURN code%,RETURN title$)
LOCAL loop%,uppercount%,char$,letter%
IF LEN(name$)<8 THEN ENDPROC
FOR loop%=1 TO LEN(name$)
   letter%=ASC(MID$(name$,loop%,1))
   IF user_uppercase%?letter% = letter% THEN uppercount%+=1
NEXT
IF uppercount%=LEN(name$) THEN code%= DELETE%:title$="CAPITALS":Use=
rTestPriority%=user_temporary%
ENDPROC

PROCuppercase simply uses the Basic MID$ keyword to loop through a string name$ one letter at a time, checking that the uppercase version of a character is the same as the original version in the string, and the ASC keyword to set the value of letter% to the ASCII code of the letter currently being checked.

The BASIC expression user_uppercase%?letter% returns the value at the address user_uppercase%+letter% — and since the letters are arranged in ASCII code value order in memory, it will thus return the upper-case equivalent version of the letter in question. If it’s a number or a punctuation character, then the ‘upper-case’ version is the same as the lower-case, and if it’s a capital letter then the upper-case version is obviously the same. Thus, it considers that all the letters in “MR. TAMBO” (including the space and the full stop) are capitals, which is the result we want. On the other hand, if we pass it the string “Harriet Bazley”, it will decide that only three out of the fourteen letters (H, B and space) are capitals; thus uppercount% will be 3 and LEN(name$) will be 14, the values are not equal, and I pass the test!

Separating names and addresses

Sometimes I want to do tests on only one half of a From: header (e.g. ‘Ray Dawson’ rather than ‘<ray@raydawson.com>’ or ‘LoveYourPet@asp-platform.com’ rather than ‘“Love Your Pet”’). I use PROCnameaddress to split a From header up into two parts, the ‘name’ and the ‘address’, so that I can test the one without the other. As before, I’m using the Basic RETURN keyword in the procedure parameters because I need to return more than one value.

DEF PROCnameaddress(from$,RETURN nick$, RETURN email$)
REM separate email address from human-readable name (if any)
LOCAL start%,end%
from$=MID$(from$,7)
REM remove ‘FROM: ’

Note that because this procedure is intended always to be called using header$ rather than data$ (in order to get the result as originally received rather than all in lowercase), I’m starting from the seventh letter of the string here without checking to see that the first six letters actually are ‘From: ’. This is not necessarily a good idea....

start%=INSTR(from$,"<")
REM look for component in angle brackets
IF start%>0 THEN
   end%=INSTR(from$,">")
   email$=MID$(from$,start%+1,end%-start%-1)
   nick$=MID$(from$,1,start%-1)+MID$(from$,end%+1)

Unfortunately, there are many different possible formats for an email address. Perhaps the easiest ones to detect are those where the actual address is surrounded in angle brackets; here I’m getting the location of the opening and closing brackets (if any) into the local variables start% and end%, and using those values in MID$ to slice the address out from the string provided. The section before (if any) and the section after (if any) are then spliced together to deliver the other half of the email address, the human-readable ‘nick’. This is done simply in order to avoid having to check whether the section in angle brackets comes at the start or end!

ELSE
   REM if no angle brackets then string must start with email address
   start%=INSTR(from$," ")
   IF start% THEN nick$=MID$(from$,start%+1):email$= LEFT$(from$,start%):ELSE nick$="":email$=from$
ENDIF
ENDPROC

For the formats which don’t use angle brackets to demarcate the machine-readable portion of the address, we can assume that it will instead occur at the start of the From line, with the human-readable portion which follows enclosed in double quotes, round brackets or some other form of quotation mark. In this case, the end of the address will be signalled by the first space character in the string, since a space is not valid in an email address, and everything following the space will be treated as the ‘nick’; if there is no space character at all, then there is no human-readable portion to this address, and we return the whole thing in email$.

Detecting computer-generated addresses

PROCmanydigits is an extremely handy routine which checks how many numerical digits are contained in a given string — this one test alone accounts for about 50% of all spam I detect. Like all my test procedures, I pass it the special AntiSpam variables “exit%” and “UserTestLog$” when I call it and use the RETURN keyword to allow it to pass back any modifications — if the test decides that the header is spam, then I set the value passed to be the special AntiSpam value _DELETE%.

DEF PROCmanydigits(address$,RETURN code%,RETURN title$)
LOCAL loop%,numcount%,char$
FOR loop%=LEN(address$) TO 1 STEP -1
   char$=MID$(address$,loop%,1)
   IF char$<="9" AND char$>="0" THEN numcount%+=1
NEXT
IF numcount%>3 THEN
   code%= _DELETE%
   title$="many digits"
ENDIF
ENDPROC

This simply loops through a string and adds one to the variable numcount% every time it detects a numerical digit — it allows an address to have up to three digits in it before it decides that it is spam, allowing through addresses like <spyro@f2s.com> or <richard@dmc12.demon.co.uk> but rejecting <jrucgp767229@updates.msdn.com>, <ariana3k1l1841@yahoo.com>, or <p52elzlk2ka2@aol.com>. For some reason this catches the vast majority of computer-generated addresses — you’d think the spammers could just as easily remove this betraying feature by generating them out of random letters alone, but they don’t seem to have cottoned onto this yet.

Deleting advertisements

PROCadtest is a quick test which makes use of the upper-case table again... this time, to test for punctuation! Some spammers seem to be under the impression that, provided they prefix their missives with ‘adv’ to indicate that it is an advertisement, people won’t complain. This would be very kind of them, since it makes them easier to spot, if it were not that they all use different conventions, making my initial simplistic AntiSpam rule “Delete Subject: = adv:*” fairly useless. Some use “ADV:”, some “(adv)”, some “ADV -”, and so on.

DEF PROCadtest(subject$,RETURN code%,RETURN title$)
LOCAL start%,letter%
start%=INSTR(subject$,"adv")
IF start%>0 THEN
 letter%=ASC(MID$(subject$,start%+3))
  IF (user_uppercase%?letter%)=letter% THEN code%=_DELETE%:title$="Ad=
vertisement":UserTestPriority%=user_starfighter%
 REM "adv" not followed by a-z
ENDIF
ENDPROC

It searches for the string ‘adv’ anywhere in the subject (to get around people who prefix it by one or more brackets) and then checks the letter immediately following the ‘v’ (in other words, three letters after the start of ‘adv’). Provided we take care to call the procedure as “PROCadtest(data$,exit%,UserTestLog$)” (i.e. using the normal all-lower-case version of the header, data$, rather than the ‘raw’ version header$ used to check for capitals), if the letter following the ‘v’ is reported as an ‘upper-case’ letter (uppercase version is the same as lowercase) we know that it must be a punctuation mark, and therefore this is almost certainly spam.

The idea is that it will catch all occurrences of the string “adv” (in upper- or lower-case) followed by a punctuation mark, but not any email subject which happens to mention advice, or even advertisements. I find it catches a surprisingly high proportion of spam, although it’s all from one or two sources.

Catching spam words but not substrings

The most common kind of AntiSpam Rule consists of checking for the presence of a given keyword, e.g. ‘Rates’ or ‘Money’, using the syntax ‘Delete Subject: = *Rates*’. The disadvantage of this, particularly for short keywords like ‘Rates’, is that *Rates* will also match any word of which this is a substring, e.g. ‘accelerates’. Sometimes it is convenient to be able to specify that a given action — usually _DELETE% — should be taken only if the string in question is a word on its own and not a substring. This can be done using very similar principles to those in PROCadtest, if we define a ‘word’ as a set of letters delimited by a non-alphabetical character at beginning and end. The only additional complication is caused by the possibility that the ‘character’ at the start or end of the word might instead be the start or end of the string itself....

DEF PROCwholeword(line$,word$,RETURN code%,RETURN title$)
LOCAL ptr%,char%
ptr%=INSTR(line$,word$)
IF ptr%=0 THEN ENDPROC
REM word not found at all

char%=ASC(MID$(line$,ptr%-1))
REM check preceding letter
IF ptr%>1 AND user_uppercase%?char%<>char% THEN ENDPROC
REM if preceding character has upper-case version 
REM it is not punctuation
REM therefore this is a substring - exit now

char%=ASC(MID$(line$,ptr%+LEN(word$)))
REM check following letter
REM character beyond end of string is returned as ASC -1
REM if next char is beyond end of string, word itself must be at end
IF char%<>-1 AND user_uppercase%?char%<>char% THEN ENDPROC

REM Otherwise...
code%=_DELETE%
title$="Word ‘"+word$+"’ found in subject line"
ENDPROC

Putting it all together

So far, I have only given examples of the main FNUserTest_DoTest which illustrate how to call a given single user test function. Here, just as an example, is a copy of the FNUserTest_DoTest that I myself use, showing how multiple tests can be applied to a single header:

DEF FNUserTest_DoTest(kw$, data$, header$, mbox%)
LOCAL exit%,char$
LOCAL address$,name$
LOCAL ERROR
ON ERROR LOCAL: ON ERROR OFF:ERROR ERR,REPORT$+" at line "+STR$(ERL)

The normal AntiSpam error handling silently suppresses any errors that occur during user tests in order to avoid hanging up the entire computer with a Wimp error box while downloading email. Here, I have deliberately overridden this with a local error handler that will cause AntiSpam to report any errors and quit the program. This is so that if I introduce errors into my procedures while modifying them, I get to find out about it — instead of just wondering why they don’t seem to be having any effect!

If you do this, however, you must then always try out your User Tests as thoroughly as possible using the Trial window before doing a subsequent download. Otherwise, if an error does show up in an obscure part of your programming, AntiSpam will crash and quit while actually downloading — highly undesirable to say the least.

CASE kw$ OF
  WHEN "from":UserTestPriority%=user_top_priority%
               PROCnameaddress(header$,name$,address$)
               PROCmanydigits(address$,exit%,UserTestLog$)
               REM catch computer-generated addresses
               IF exit%=0 THEN PROCuppercase(name$,exit%,UserTestLog$)

Note that I am splitting up the ‘From’ address into two halves using PROCnameaddress and applying different tests to each; note also that I’m not bothering to carry out the second check if the first one has already succeeded (i.e. if exit% has already been assigned a non-zero value).

 WHEN "to": UserTestPriority%=user_afternames%
               PROCbigfoot(data$,exit%,UserTestLog$)
 WHEN "cc": UserTestPriority%=user_afternames%
               PROCbigfoot(data$,exit%,UserTestLog$)

These two headers are being tested on a specific ‘accumulator’ (AND) rule which checks for the presence of multiple bigfoot.com addresses in the To: header, or in both the To: header and in the CC: header. It works in a similar way to the AND rule discussed in Part 2, and you will see the familiar check for user_spamprobability%>1 down below.

  WHEN "subject":UserTestPriority%=user_temporary%
               PROCadtest(data$,exit%,UserTestLog$)
               IF exit%=0 THEN PROCuppercase(MID$(header$,10),exit%,UserTestLog$):REM remove Subject:
               IF exit%=0 THEN PROCwholeword(data$,"Rates",exit%,UserTestLog$)

ENDCASE

There are three different tests being applied to the Subject: header here, but the second two take place only if the earlier ones fail. Note also that PROCuppercase is being called (of course) using the raw header$ instead of the lower-case data$ passed to most of the other user functions; and that as a result I have to forcibly remove the ‘Subject: ’ from the beginning of the string using the MID$ operation. Finally, note that you need a separate call to PROCwholeword for each individual non-substring word you want to check for! In this case I’m only checking for one — if I were to be scrupulous and apply this to all the keywords in my Rules file, I would end up with about twenty lines at this point reading IF exit%=0 THEN PROCwholeword....

IF user_spamprobability%>1 THEN UserTestPriority%=user_temporary%:exit%=_DELETE%:user_spamprobability%=-10
REM stop it returning on every single header line once a delete
REM message has been accumulated
=exit%

This last check should be readily recognisable from the earlier discussion of how to accumulate spam probabilities over the course of checking several different headers. The only difference here is that I’ve added a cosmetic extra statement user_spamprobability%=-10.

Because this section of code is carried out every time the program reaches this point, once the variable user_spamprobability% is greater than 1 it will return the _DELETE% code for every single remaining header in the email. This doesn’t affect the outcome of the test, since we’ve already decided that we want this email deleted, and any Rules with higher priority will still override it — but does involve pressing the Continue button another ten or more times in the Trial window when testing it out! By subtracting 10 from the value of the variable once it has been duly noted, I try to make sure that this part of the program is only triggered once.

Conclusion — and a word of warning

I hope these procedures are either helpful in themselves, or helpful as a guide to what can be done. BASIC’s string handling facilities are not ideal (although a better programmer than I am could call the Regular Expressions module from BASIC), but simple loops and the use of MID$ can do a great deal. All the user tests I’ve illustrated catch a fairly high proportion of our spam, the lowest being 20 out of 650; still higher than almost any of the individual keyword tests. There isn’t much point in programming a complex test which will only catch an isolated spam every now and then — on the other hand, if you can identify a distinctive spam trait (email CC’d to multiple Bigfoot addresses, for example), it’s probably possible for someone to write a BASIC procedure to check for it.

Have fun defeating the spammers’ best efforts — there is a great deal of satisfaction to be had in watching the output of ‘your’ tests appear in the log file — but don’t forget: always use the Trial window to test out your procedures thoroughly before you let them loose. Test them on the type of spam they are designed to catch and on valid messages as well, and remember that you have to quit and reload AntiSpam every time you change the UserTests file before your changes will actually show up...!

AntiSpam can and will delete your email. That’s what it was written to do. Be careful.


Contents    Back