Writing UserTests for AntiSpam (2)

Harriet Bazley

Initialisation code

The AntiSpam UserTests file contains three other Basic procedures in addition to FNUserTest_DoTest, and each serves a similar purpose — they allow you to [re-]initialise areas of memory and variables at a specific stage in the program.

PROCUserTest_Initialise is the simplest; this is called once only, during the initialisation of AntiSpam itself. This is where the major initialisation code for any complex structures will go. To illustrate this, we can use the example of a whitelist.

Whitelisting the contents of a Messenger address-book

One technique used quite frequently in spam filtering is that of a whitelist — a list of people who are known not to be spam sources, and need not be filtered. For example, if I had an AntiSpam Rule which deletes all email with a subject line containing the keyword “debt”, if you were on my whitelist you could still send me an email with the subject “Greatly in your debt for writing Textseek”, or some such flattering remark, without getting filtered out!

A common source of addresses for a whitelist is the address-book file of the user’s email client; this means that it is automatically updated every time the master list changes, without having to be maintained in any way. For this example, I’ve chosen R-Comp’s Messenger Pro.

Investigation shows that the ‘public’ address-book can be found at NewsDir:Messenger.AddrBook, although in real life you’d probably want to locate and scan the private address-books of each Messenger ‘user’ as well. For demonstration purposes — this is an article on UserTest programming, not on data management — we are simply going to load the file into an array and loop through the entire array every time we want to check if an address is present, though in practice you might want to use a slightly more sophisticated search technique.

Since Basic arrays can be dimensioned only once, this needs to be done within PROCUserTest_Initialise, and we first need to establish how many entries will be needed. As the address-book file contains one address per line, all we need to do is to open the file from Basic and load it in one line at a time, counting as we go:

DEF PROCUserTest_Initialise
LOCAL i%,addr$,load%,tabpos%,nexttabpos%
i%=OPENIN"NewsDir:Messenger.AddrBook"
IF i%=0 THEN ERROR 255,"Address book not found"
REPEAT
  addr$=GET$#i%:user_count%+=1
UNTIL (EOF#i%)
REM find required array size

Local and global variables

Note the use above of ‘user_’ to prefix the count variable. This is because the number of entries in the array will need to be a global variable (one not declared as LOCAL to the procedure) so that we can access it from FNUserTest_DoTest later on.

It is strongly recommended that, in AntiSpam UserTests, all new global variables that you create should be prefixed by ‘user_’ in this way. Otherwise, as you can imagine, if a user goes round DIMming blocks of memory as block% or passing strings around as boundary$ or idle$ he runs a severe risk of corruptmg values that may be in use by AntiSpam itself. Only those variables prefixed by user_ can be guaranteed to be unique.

A LOCAL variable declaration is used at the top of the procedure to ensure that all the other variables we are going to need are forgotten about as soon as we reach the ENDPROC.

Filling the array

DIM user_addresslist$(user_count%)
PTR#i%=1
REM return to top of open file
FOR load%=0 TO user_count%
  addr$=FNlcase(GET$#i%)
  tabpos%=INSTR(addr$,CHR$(9))+2
  nexttabpos%=INSTR(addr$,CHR$(9),tabpos%)
  user_addresslist$(load%)=MID$(addr$,tabpos%,nexttabpos%-tabpos%)
    IF LEFT$(user_addresslist$(load%),1)="#" THEN load%-=1:user_count%-=1:REM ignore mailing list entries
NEXT
REM make address list
CLOSE#i%
ENDPROC

Having established how large it needs to be, we can now DIMension the array (note: another global variable) and set the file pointer back to the top of the file (PTR#i%=1) so that future file accesses will start from the beginning again.

The part of the address-book entry we are interested in, the email address, is located on each line between two tab characters. The section of code within the loop simply locates the offset of the first tab character and the number of characters between them, and takes a slice out of the middle of the string of the requisite length, slightly complicated by the need to ignore special-format list entries which do not contain a usable address. I have cheated slightly by looking inside the main AntiSpam program and using the routine already defined there, FNlcase, to force my addresses to lowercase — there is no point in reinventing the wheel....

We have thus used PROCUserTest_Initialise to set up an entire array of people whose addresses we wish to ‘whitelist’, which can then be accessed from FNUserTest_DoTest. This can be done using simplistic code:

 WHEN "from":
   FOR addressloop%=1 TO user_count%
    addressfound%=INSTR(data$,(user_addresslist$(addressloop%)))
    IF addressfound%>0 THEN 
      exit%=ACCEPT%
      UserTestLog$="Address in address book"
      UserTestPriority%=1
      addressloop%=user_count%+100
   NEXT addressloop%
 REM loop round whole addressbook checking for this address

This is not exactly sophisticated; the only point worth noting is that if a name is found, the value of addressloop% is increased by 100 more than the end condition of the loop, forcing the following NEXT statement to fail. This compensates slightly for the inefficiency of the search by causing an early exit on the cases where it does succeed.

Setting relative test priorities

PROCUserTest_NewMailbox, as the name indicates, is called every time a new ‘mailbox’ is scanned, that is, once for every separate address from which AntiSpam is configured to download email. Theoretically speaking, this allows you to have user tests which only apply to one mailbox, just as it is possible to specify a separate Rules file for every mailbox in the Config file.

However, it is perhaps more useful to consider it as a procedure which is called once per download session and immediately before the download begins. Here you can [re-]initialise variables that will be used to store values that may be applied across several messages; for example, you could decide to let the first ten messages through ‘free’, and only filter on the rest!

Perhaps the most useful thing to bear in mind, though, is that the Rules file is read in afresh immediately before each call to PROCUserTest_NewMailbox. Obviously, if you are specifying different Rules files for different mailboxes, this is vital, but it also ensures that any changes made to the Rules file since the last download will be picked up without the need to restart AntiSpam — and this is also why, if you are using the User-Priority keyword to establish relative priority positions between user rules and specific entries in a Rules file, that any calls to FNget_user_priority need to go in here. Altering your Rules file is highly likely to have changed the location of a given Rule....

The User-Priority keyword

In all the examples so far, we have been using ‘absolute’ values when setting UserTestPriority%. This is no problem if we know, for example, that we wish our whitelist of addresses to overrule all other Rules — we simply assign it a priority of 1, making it equivalent to the top line in the Rules file. A more common scenario, however, is that we have perhaps thirty or forty Rules, of which the top seven are Accept rules, and that we wish our user tests to outrank all the rest except for the Accept rules.

It is easy enough to set UserTestPriority%=8 when programming a user test; the problems come if you subsequently edit the Rules file to add another Accept rule, which then comes out as lower-ranked than all the user tests! In the early days of AntiSpam, I actually lost email this way.

Fortunately, there is now a mechanism which allows you to ‘reserve’ space in the Rules file. You can insert a User-Priority line into the Rules file to represent the position of a user test, and it will act as an extra, invisible, Rule, moving up and down in the file relative to the lines around it as Rules are added, deleted or changed. You can have as many or as few such lines as you like — there is nothing to prevent all your user tests sharing the same priority, but you can also insert special cases to deal with specific Rules.

For example, I have an Accept Rule that accepts email with the string “3000” in the subject line, to rescue email referring to StarFighter 3000, SF3000, etc. from being deleted by subsequent user tests; but as we also receive a lot of advertising spam from PC vendors offering us Sony AV3000 LCD screens and so on, I want to make sure that the user test that checks for the string “ADV” followed by a punctuation symbol has a higher priority than this Accept Rule, but a lower priority than all the others. As a result, I have a special entry “User-Priority StarFighter 3000” immediately before “Accept Subject: = *3000*”, ensuring that however many Rules get added above or below, I can always guarantee that the priority for the adv-test will be greater than that for this Rule.

FNget_user_priority

We can access the position of individual User-Priority lines from Basic by calling the special AntiSpam function FNget_user_priority(text$). The value of text$ should be the remaining ‘label’ text which follows the actual User-Priority keyword, e.g. “StarFighter 3000”. The value returned by the function will be the relevant Rule number.

Note that you can’t call FNget_user_priority from PROCUserTest_Initialise, since the Rules file doesn’t get read in until just before AntiSpam actually goes on-line! You can do so from within the actual user test itself — UserTestPriority%=FNget_user_priority("StarFighter 3000" — but since I have a lot of user tests, many of which share the same priority, I normally set up my priorities as global variables from within PROCUserTest_NewMailbox at the start of each run:

DEF PROCUserTest_NewMailbox(mbox%)
user_top_priority%=FNget_user_priority("Accepts end here")
REM this adjusts if we add more Accept rules to Rules file
user_starfighter%=FNget_user_priority("StarFighter 3000")
user_temporary%=FNget_user_priority("temporary section")
user_afternames%=FNget_user_priority("after names")
ENDPROC

Spam tests that operate across several headers

Sometimes, simply using LOCAL variables from within FNUserTest_DoTest is not enough. These get reset every time a new header line is checked — and while this is generally exactly what is wanted, for some kinds of comparison one needs to compare values between different headers.

A classic example is AND and OR comparisons, which can only be made via user tests. There is no way, in an AntiSpam Rules file, to say “if the Subject: header contains the string ‘free’ and the To: header does not contain my email address, then delete”. However, in terms of Basic logic, this is easy. We simply use WHEN "subject" and WHEN "to" statements, and insert a line making the relevant checks into each, incrementing an integer variable to tell us when/if all the conditions have been met.

It is, of course, extremely important to ensure that this ‘spam probability’ count is reset at the start of each new message; otherwise, once the necessary probability has accumulated, the program will end up thinking that all subsequent messages are spam! (Yes, I have actually done this....) This is where PROCUserTest_NewMessage comes in. It is called every time a new message is downloaded.

An AND rule

The problem is that you have to ‘accumulate’ a total spam probability over the course of several headers, which translates into several separate calls of FNUserTest_DoTest(). In order to implement a simple AND rule along these lines, we thus need some special variables to store data in between calls — but need to ensure that they are re-zeroed inbetween messages.

DEF PROCUserTest_NewMessage(mbox%)
  user_spamprobability%=0
  REM stuff for accumulating likely spam traits
  REM one alone is not enough to cause rejection,
  REM but two or more are fatal
  UserTestLog$ = ""
ENDPROC

Note that we also need to locate the existing line which reads UserTestLog$ = "" within FNUserTest_DoTest, and delete it! In its current position, it will blank this variable every time a new header line is encountered — we need to keep it while comparing several different header lines, and blank it whenever a new message is encountered. PROCUserTest_NewMessage will do that job for us, and since tests will always be altering any existing value of UserTestLog$ rather than reading it, moving the initialisation to an earlier stage will have no ill-effect.

The variable user_spamprobability% starts off with a value of zero (as set up in PROCUserTest_NewMessage), so if we add one to it every time a spam condition is met (i.e. “Field ‘To:’ does not contain my email address” AND “Field Subject:’ contains ‘free’”), when it reaches a value of 2 or more this signifies a spam email.

It is sufficient, therefore (assuming the existence of user_spamprobability%-manipulating code which we haven’t written yet!) to add a line

  IF user_spamprobability%>=2 THEN LET UserTestlog$="Free email not addressed to me":UserTestPriority%=20:exit%=_DELETE% 

after the ENDCASE statement in FNUserTest_DoTest, and to increment the aforesaid user_spamprobability% in two different places inside the CASE block:

DEF FNUserTest_DoTest(kw$, data$, header$, mbox%)
LOCAL exit%
REM LOCAL variables always start with the value 0
CASE kw$ OF
   WHEN "to":IF data$<>"harriet@bazley.freeuk.com" THEN user_spamprobability%+=1
   WHEN "subject":IF data$="free" THEN user_spamprobability%+=1
ENDCASE
IF user_spamprobability%>=2 THEN LET UserTestLog$="Free email not addressed to me":UserTestPriority%=20:exit%=_DELETE%
=exit%

Remember that data$ has been transformed to lower-case for comparison purposes!

In my final article, to illustrate how a number of different user tests can co-exist within the same UserTests file, I shall explain some of the actual user tests that I use myself. Between them they account for a high proportion of our incoming spam.


Contents    Back    Continue