I’ve been running a NeuralMesh Artificial Neural Network for about a week now. The point of it is for SPAM detection. I am feeding it 7 different spam scores from the b8 baysian library.
My 7 ‘aspects’ of email spam are:
- to
- from
- full from
- subject
- full subject
- header
- body
I am using the To field as sometimes spam is sent to funny names, and sometimes real email is sent to a particular person. The From figures on both the email address and name it comes from – luckily some spammers put their product names in the from. The Full From processes the from field on a whole, this should be good for duplicate messages – when I’ve seen it once before, I have a good idea for what it is. The subject and full subject operate on the same principle, however sometimes a random word is put into a subject, so the full subject isn’t necessary a solid repeat indicator. The header is designed to look for SPF records and IP addresses in addition to including the above bits. The body, well, that just the whole message.
I feed all these different scores into a 7 input, 5 neuron, 1 output ANN to let it sort out what’s meaningful and what’s not. Every time somebody deletes a message as spam or accepts an email as not spam, I use that as a training trigger for both the b8 Bayesian library and the NeuralMesh ANN library. Because my spam to not spam ration is so out of whack, the training loop doesn’t need many iterations for spam, but needs a lot for not spam.
To address this, I am using an adaptive training iteration system. The closer to a pure spam score the system has, the fewer training iterations, the further from a pure spam score, the more iterations I do.
$learning_iterations = round((1 – $finalscore[‘AI Score’]) * 100);
This seems to be working well, judging by the table below of the last however many training sessions the ANN has underwent. The fewer the number of Iterations, the higher the spam score, which the system has seen a lot of. Looking at the Start MSE compared to the end MSE, any training under 4 iterations doesn’t really change anything. The higher number of iterations events are not spam, so the system is starting with a very high Start MSE, and the extra training iterations is bringing the End MSE right down.
You can see after every non-spam training session, it takes a couple of data sets before it settles back down to it’s usual boring 3s&4s for training. The data corpus is still pretty small, only a few non-spam data points that have been trained into the system so far.
I think I am going to skip all training under 3 iterations, as it seems to not really make any difference. It probably just loads the system up with extra data points, and will eventually slow it down as I accumulate more data. I loose a lot of the little finesse trainings, but I think having fewer meaningful trainings as the corpus drifts will be more resource friendly.
Iterations | Start MSE | End MSE | Date | Exec Time | Off-line |
---|---|---|---|---|---|
66 | 0.978118000 | 0.000000000 | 16/12/2010 9:33:15 am | 0.01188200 | n |
3 | 0.000087000 | 0.000087000 | 16/12/2010 8:45:08 am | 0.00063600 | n |
3 | 0.000108000 | 0.000108000 | 16/12/2010 8:44:43 am | 0.00059300 | n |
3 | 0.000098000 | 0.000098000 | 16/12/2010 8:44:41 am | 0.00056200 | n |
3 | 0.000083000 | 0.000083000 | 16/12/2010 8:44:40 am | 0.00061200 | n |
3 | 0.000107000 | 0.000107000 | 16/12/2010 8:44:39 am | 0.00062100 | n |
4 | 0.000100000 | 0.000100000 | 16/12/2010 8:44:37 am | 0.00078800 | n |
3 | 0.000078000 | 0.000078000 | 16/12/2010 8:44:36 am | 0.00061500 | n |
3 | 0.000077000 | 0.000077000 | 16/12/2010 8:44:35 am | 0.00069400 | n |
4 | 0.000113000 | 0.000113000 | 16/12/2010 8:44:33 am | 0.00076700 | n |
3 | 0.000114000 | 0.000114000 | 16/12/2010 8:44:31 am | 0.00060800 | n |
3 | 0.000069000 | 0.000068000 | 16/12/2010 8:44:30 am | 0.00061600 | n |
4 | 0.000105000 | 0.000104000 | 16/12/2010 8:44:09 am | 0.00075900 | n |
4 | 0.000101000 | 0.000101000 | 16/12/2010 8:44:08 am | 0.00075800 | n |
4 | 0.000118000 | 0.000118000 | 16/12/2010 8:44:06 am | 0.00075200 | n |
3 | 0.000119000 | 0.000119000 | 16/12/2010 8:44:04 am | 0.00062000 | n |
4 | 0.000066000 | 0.000065000 | 16/12/2010 8:44:04 am | 0.00086800 | n |
3 | 0.000119000 | 0.000119000 | 16/12/2010 8:44:01 am | 0.00061100 | n |
3 | 0.000120000 | 0.000120000 | 16/12/2010 8:44:01 am | 0.00063000 | n |
4 | 0.000123000 | 0.000123000 | 16/12/2010 8:43:58 am | 0.00086400 | n |
4 | 0.000112000 | 0.000112000 | 16/12/2010 8:43:30 am | 0.00080700 | n |
3 | 0.000113000 | 0.000113000 | 16/12/2010 8:43:29 am | 0.00058800 | n |
4 | 0.000114000 | 0.000114000 | 16/12/2010 8:43:28 am | 0.00076300 | n |
3 | 0.000128000 | 0.000128000 | 16/12/2010 8:43:27 am | 0.00060900 | n |
4 | 0.000129000 | 0.000129000 | 16/12/2010 8:43:26 am | 0.00088600 | n |
3 | 0.000105000 | 0.000105000 | 16/12/2010 8:43:24 am | 0.00058300 | n |
4 | 0.000131000 | 0.000130000 | 16/12/2010 8:43:23 am | 0.00076200 | n |
4 | 0.000082000 | 0.000082000 | 16/12/2010 8:43:21 am | 0.00075600 | n |
3 | 0.000134000 | 0.000134000 | 16/12/2010 8:43:20 am | 0.00061800 | n |
4 | 0.000122000 | 0.000122000 | 16/12/2010 8:43:04 am | 0.00075100 | n |
4 | 0.000137000 | 0.000136000 | 16/12/2010 8:43:02 am | 0.00079800 | n |
4 | 0.000139000 | 0.000138000 | 16/12/2010 8:43:02 am | 0.00086800 | n |
4 | 0.000140000 | 0.000139000 | 16/12/2010 8:43:00 am | 0.00073600 | n |
4 | 0.000142000 | 0.000141000 | 16/12/2010 8:42:59 am | 0.00075300 | n |
4 | 0.000128000 | 0.000128000 | 16/12/2010 8:42:59 am | 0.00076200 | n |
4 | 0.000099000 | 0.000098000 | 16/12/2010 8:42:57 am | 0.00075400 | n |
4 | 0.000114000 | 0.000113000 | 16/12/2010 8:42:17 am | 0.00084600 | n |
4 | 0.000147000 | 0.000146000 | 16/12/2010 8:39:23 am | 0.00076900 | n |
4 | 0.000149000 | 0.000148000 | 16/12/2010 8:39:20 am | 0.00079300 | n |
4 | 0.000136000 | 0.000135000 | 16/12/2010 8:39:18 am | 0.00081300 | n |
4 | 0.000138000 | 0.000137000 | 16/12/2010 8:39:15 am | 0.00075000 | n |
4 | 0.000158000 | 0.000157000 | 16/12/2010 8:39:13 am | 0.00079700 | n |
4 | 0.000155000 | 0.000154000 | 16/12/2010 8:39:13 am | 0.00074500 | n |
8 | 0.000291000 | 0.000160000 | 16/12/2010 8:39:12 am | 0.00150400 | n |
19 | 0.000667000 | 0.000524000 | 16/12/2010 8:37:45 am | 0.00386300 | n |
21 | 0.002455000 | 0.000771000 | 16/12/2010 8:37:43 am | 0.00371700 | n |
84 | 0.075070000 | 0.000000000 | 16/12/2010 8:35:36 am | 0.01497400 | n |
5 | 0.000146000 | 0.000144000 | 16/12/2010 8:34:20 am | 0.00097200 | n |
5 | 0.000148000 | 0.000147000 | 16/12/2010 8:34:18 am | 0.00094400 | n |
5 | 0.000151000 | 0.000149000 | 16/12/2010 8:34:17 am | 0.00095600 | n |
5 | 0.000105000 | 0.000104000 | 16/12/2010 8:34:14 am | 0.00096000 | n |
9 | 0.000123000 | 0.000121000 | 16/12/2010 8:34:03 am | 0.00164900 | n |
9 | 0.000126000 | 0.000124000 | 16/12/2010 8:30:34 am | 0.00171600 | n |
6 | 0.000129000 | 0.000127000 | 16/12/2010 8:30:13 am | 0.00114000 | n |
5 | 0.000159000 | 0.000157000 | 16/12/2010 8:30:12 am | 0.00092800 | n |
6 | 0.000169000 | 0.000167000 | 16/12/2010 8:30:11 am | 0.00115800 | n |
6 | 0.000172000 | 0.000169000 | 16/12/2010 8:29:32 am | 0.00231000 | n |
6 | 0.000150000 | 0.000148000 | 16/12/2010 8:29:29 am | 0.00116100 | n |
6 | 0.000192000 | 0.000189000 | 16/12/2010 8:29:27 am | 0.00110500 | n |
6 | 0.000105000 | 0.000104000 | 16/12/2010 8:29:17 am | 0.00114200 | n |
8 | 0.000183000 | 0.000178000 | 16/12/2010 8:29:02 am | 0.00164100 | n |
6 | 0.000195000 | 0.000191000 | 16/12/2010 8:28:14 am | 0.00116900 | n |
5 | 0.000218000 | 0.000215000 | 16/12/2010 8:26:38 am | 0.00098300 | n |
6 | 0.000200000 | 0.000197000 | 16/12/2010 8:26:13 am | 0.00114300 | n |
10 | 0.000167000 | 0.000161000 | 16/12/2010 8:25:59 am | 0.00187000 | n |
6 | 0.000229000 | 0.000216000 | 16/12/2010 8:22:49 am | 0.00110500 | n |
2 | 0.004528000 | 0.004528000 | 16/12/2010 8:22:44 am | 0.00034600 | n |
7 | 0.000176000 | 0.000172000 | 16/12/2010 8:22:41 am | 0.00128300 | n |
8 | 0.000269000 | 0.000259000 | 16/12/2010 8:19:45 am | 0.00146800 | n |
11 | 0.000334000 | 0.000311000 | 15/12/2010 5:19:48 pm | 0.00199000 | n |
10 | 0.000340000 | 0.000319000 | 15/12/2010 4:10:16 pm | 0.00186500 | n |
12 | 0.000869000 | 0.000527000 | 15/12/2010 2:47:10 pm | 0.00221200 | n |
8 | 0.000471000 | 0.000441000 | 15/12/2010 2:46:57 pm | 0.00165100 | n |
12 | 0.000552000 | 0.000488000 | 15/12/2010 1:14:04 pm | 0.00221000 | n |
68 | 0.008667000 | 0.000560000 | 15/12/2010 1:12:59 pm | 0.01240600 | n |
30 | 0.976014000 | 0.000002000 | 15/12/2010 1:12:19 pm | 0.00550100 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 1:11:02 pm | 0.00086600 | n |
4 | 0.000371000 | 0.000366000 | 15/12/2010 11:13:43 am | 0.00089800 | n |
4 | 0.000013000 | 0.000013000 | 15/12/2010 9:32:22 am | 0.00078800 | n |
4 | 0.000022000 | 0.000022000 | 15/12/2010 9:32:10 am | 0.00073700 | n |
4 | 0.000247000 | 0.000243000 | 15/12/2010 9:32:09 am | 0.00079800 | n |
3 | 0.000052000 | 0.000052000 | 15/12/2010 9:32:08 am | 0.00057900 | n |
4 | 0.000034000 | 0.000034000 | 15/12/2010 9:32:07 am | 0.00079400 | n |
3 | 0.000154000 | 0.000153000 | 15/12/2010 9:32:06 am | 0.00068800 | n |
4 | 0.000019000 | 0.000019000 | 15/12/2010 9:32:04 am | 0.00081700 | n |
4 | 0.000017000 | 0.000017000 | 15/12/2010 9:32:03 am | 0.00090600 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 9:32:01 am | 0.00079600 | n |
4 | 0.000045000 | 0.000045000 | 15/12/2010 9:31:59 am | 0.00079000 | n |
4 | 0.000032000 | 0.000032000 | 15/12/2010 9:31:58 am | 0.00074200 | n |
4 | 0.000019000 | 0.000019000 | 15/12/2010 9:31:57 am | 0.00076000 | n |
4 | 0.000262000 | 0.000258000 | 15/12/2010 9:31:33 am | 0.00082000 | n |
4 | 0.000094000 | 0.000093000 | 15/12/2010 9:30:55 am | 0.00073200 | n |
3 | 0.000051000 | 0.000051000 | 15/12/2010 9:30:54 am | 0.00061300 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 9:30:53 am | 0.00078600 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 9:30:51 am | 0.00077100 | n |
4 | 0.000282000 | 0.000278000 | 15/12/2010 9:30:49 am | 0.00081600 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 9:30:48 am | 0.00074700 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 9:30:47 am | 0.00074000 | n |
4 | 0.000014000 | 0.000014000 | 15/12/2010 9:30:46 am | 0.00080700 | n |
4 | 0.000020000 | 0.000020000 | 15/12/2010 9:30:45 am | 0.00084900 | n |
I am probably going about doing this all wrong, but so far it is seaming to work. Let me know if there is a better way.