Tuning the training rate of a NeuralMesh ANN

I’ve been running a NeuralMesh Artificial Neural Network for about a week now. The point of it is for SPAM detection. I am feeding it 7 different  spam scores from the b8 baysian library.

My 7 ‘aspects’ of email spam are:

  • to
  • from
  • full from
  • subject
  • full subject
  • header
  • body

I am using the To field as sometimes spam is sent to funny names, and sometimes real email is sent to a particular person. The From figures on both the email address and name it comes from – luckily some spammers put their product names in the from.  The Full From processes the from field on a whole, this should be good for duplicate messages – when I’ve seen it once before, I have a good idea for what it is. The subject and full subject operate on the same principle, however sometimes a random word is put into a subject, so the full subject isn’t necessary a solid repeat indicator. The header is designed to look for SPF records and IP addresses in addition to including the above bits. The body, well, that just the whole message.

I feed all these different scores into a 7 input, 5 neuron, 1 output ANN to let it sort out what’s meaningful and what’s not. Every time somebody deletes a message as spam or accepts an email as not spam, I use that as a training trigger for both the b8 Bayesian library and the NeuralMesh ANN library. Because my spam to not spam ration is so out of whack, the training loop doesn’t need many iterations for spam, but needs a lot for not spam.

To address this, I am using an adaptive training iteration system.  The closer to a pure spam score the system has, the fewer training iterations, the further from a pure spam score, the more iterations I do.

$learning_iterations = round((1 – $finalscore[‘AI Score’]) * 100);

This seems to be working well, judging by the table below of the last however many training sessions the ANN has underwent.  The fewer the number of Iterations, the higher the spam score, which the system has seen a lot of. Looking at the Start MSE compared to the end MSE, any training under 4 iterations doesn’t really change anything.  The higher number of iterations events are not spam, so the system is starting with a very high Start MSE, and the extra training iterations is bringing the End MSE right down.

You can see after every non-spam training session, it takes a couple of data sets before it settles back down to it’s usual boring 3s&4s for training. The data corpus is still pretty small, only a few non-spam data points that have been trained into the system so far.

I think I am going to skip all training under 3 iterations, as it seems to not really make any difference. It probably just loads the system up with extra data points, and will eventually slow it down as I accumulate more data. I loose a lot of the little finesse trainings, but I think having fewer meaningful trainings as the corpus drifts will be more resource friendly.

IterationsStart MSEEnd MSEDateExec TimeOff-line
660.9781180000.00000000016/12/2010 9:33:15 am0.01188200n
30.0000870000.00008700016/12/2010 8:45:08 am0.00063600n
30.0001080000.00010800016/12/2010 8:44:43 am0.00059300n
30.0000980000.00009800016/12/2010 8:44:41 am0.00056200n
30.0000830000.00008300016/12/2010 8:44:40 am0.00061200n
30.0001070000.00010700016/12/2010 8:44:39 am0.00062100n
40.0001000000.00010000016/12/2010 8:44:37 am0.00078800n
30.0000780000.00007800016/12/2010 8:44:36 am0.00061500n
30.0000770000.00007700016/12/2010 8:44:35 am0.00069400n
40.0001130000.00011300016/12/2010 8:44:33 am0.00076700n
30.0001140000.00011400016/12/2010 8:44:31 am0.00060800n
30.0000690000.00006800016/12/2010 8:44:30 am0.00061600n
40.0001050000.00010400016/12/2010 8:44:09 am0.00075900n
40.0001010000.00010100016/12/2010 8:44:08 am0.00075800n
40.0001180000.00011800016/12/2010 8:44:06 am0.00075200n
30.0001190000.00011900016/12/2010 8:44:04 am0.00062000n
40.0000660000.00006500016/12/2010 8:44:04 am0.00086800n
30.0001190000.00011900016/12/2010 8:44:01 am0.00061100n
30.0001200000.00012000016/12/2010 8:44:01 am0.00063000n
40.0001230000.00012300016/12/2010 8:43:58 am0.00086400n
40.0001120000.00011200016/12/2010 8:43:30 am0.00080700n
30.0001130000.00011300016/12/2010 8:43:29 am0.00058800n
40.0001140000.00011400016/12/2010 8:43:28 am0.00076300n
30.0001280000.00012800016/12/2010 8:43:27 am0.00060900n
40.0001290000.00012900016/12/2010 8:43:26 am0.00088600n
30.0001050000.00010500016/12/2010 8:43:24 am0.00058300n
40.0001310000.00013000016/12/2010 8:43:23 am0.00076200n
40.0000820000.00008200016/12/2010 8:43:21 am0.00075600n
30.0001340000.00013400016/12/2010 8:43:20 am0.00061800n
40.0001220000.00012200016/12/2010 8:43:04 am0.00075100n
40.0001370000.00013600016/12/2010 8:43:02 am0.00079800n
40.0001390000.00013800016/12/2010 8:43:02 am0.00086800n
40.0001400000.00013900016/12/2010 8:43:00 am0.00073600n
40.0001420000.00014100016/12/2010 8:42:59 am0.00075300n
40.0001280000.00012800016/12/2010 8:42:59 am0.00076200n
40.0000990000.00009800016/12/2010 8:42:57 am0.00075400n
40.0001140000.00011300016/12/2010 8:42:17 am0.00084600n
40.0001470000.00014600016/12/2010 8:39:23 am0.00076900n
40.0001490000.00014800016/12/2010 8:39:20 am0.00079300n
40.0001360000.00013500016/12/2010 8:39:18 am0.00081300n
40.0001380000.00013700016/12/2010 8:39:15 am0.00075000n
40.0001580000.00015700016/12/2010 8:39:13 am0.00079700n
40.0001550000.00015400016/12/2010 8:39:13 am0.00074500n
80.0002910000.00016000016/12/2010 8:39:12 am0.00150400n
190.0006670000.00052400016/12/2010 8:37:45 am0.00386300n
210.0024550000.00077100016/12/2010 8:37:43 am0.00371700n
840.0750700000.00000000016/12/2010 8:35:36 am0.01497400n
50.0001460000.00014400016/12/2010 8:34:20 am0.00097200n
50.0001480000.00014700016/12/2010 8:34:18 am0.00094400n
50.0001510000.00014900016/12/2010 8:34:17 am0.00095600n
50.0001050000.00010400016/12/2010 8:34:14 am0.00096000n
90.0001230000.00012100016/12/2010 8:34:03 am0.00164900n
90.0001260000.00012400016/12/2010 8:30:34 am0.00171600n
60.0001290000.00012700016/12/2010 8:30:13 am0.00114000n
50.0001590000.00015700016/12/2010 8:30:12 am0.00092800n
60.0001690000.00016700016/12/2010 8:30:11 am0.00115800n
60.0001720000.00016900016/12/2010 8:29:32 am0.00231000n
60.0001500000.00014800016/12/2010 8:29:29 am0.00116100n
60.0001920000.00018900016/12/2010 8:29:27 am0.00110500n
60.0001050000.00010400016/12/2010 8:29:17 am0.00114200n
80.0001830000.00017800016/12/2010 8:29:02 am0.00164100n
60.0001950000.00019100016/12/2010 8:28:14 am0.00116900n
50.0002180000.00021500016/12/2010 8:26:38 am0.00098300n
60.0002000000.00019700016/12/2010 8:26:13 am0.00114300n
100.0001670000.00016100016/12/2010 8:25:59 am0.00187000n
60.0002290000.00021600016/12/2010 8:22:49 am0.00110500n
20.0045280000.00452800016/12/2010 8:22:44 am0.00034600n
70.0001760000.00017200016/12/2010 8:22:41 am0.00128300n
80.0002690000.00025900016/12/2010 8:19:45 am0.00146800n
110.0003340000.00031100015/12/2010 5:19:48 pm0.00199000n
100.0003400000.00031900015/12/2010 4:10:16 pm0.00186500n
120.0008690000.00052700015/12/2010 2:47:10 pm0.00221200n
80.0004710000.00044100015/12/2010 2:46:57 pm0.00165100n
120.0005520000.00048800015/12/2010 1:14:04 pm0.00221000n
680.0086670000.00056000015/12/2010 1:12:59 pm0.01240600n
300.9760140000.00000200015/12/2010 1:12:19 pm0.00550100n
40.0000140000.00001400015/12/2010 1:11:02 pm0.00086600n
40.0003710000.00036600015/12/2010 11:13:43 am0.00089800n
40.0000130000.00001300015/12/2010 9:32:22 am0.00078800n
40.0000220000.00002200015/12/2010 9:32:10 am0.00073700n
40.0002470000.00024300015/12/2010 9:32:09 am0.00079800n
30.0000520000.00005200015/12/2010 9:32:08 am0.00057900n
40.0000340000.00003400015/12/2010 9:32:07 am0.00079400n
30.0001540000.00015300015/12/2010 9:32:06 am0.00068800n
40.0000190000.00001900015/12/2010 9:32:04 am0.00081700n
40.0000170000.00001700015/12/2010 9:32:03 am0.00090600n
40.0000140000.00001400015/12/2010 9:32:01 am0.00079600n
40.0000450000.00004500015/12/2010 9:31:59 am0.00079000n
40.0000320000.00003200015/12/2010 9:31:58 am0.00074200n
40.0000190000.00001900015/12/2010 9:31:57 am0.00076000n
40.0002620000.00025800015/12/2010 9:31:33 am0.00082000n
40.0000940000.00009300015/12/2010 9:30:55 am0.00073200n
30.0000510000.00005100015/12/2010 9:30:54 am0.00061300n
40.0000140000.00001400015/12/2010 9:30:53 am0.00078600n
40.0000140000.00001400015/12/2010 9:30:51 am0.00077100n
40.0002820000.00027800015/12/2010 9:30:49 am0.00081600n
40.0000140000.00001400015/12/2010 9:30:48 am0.00074700n
40.0000140000.00001400015/12/2010 9:30:47 am0.00074000n
40.0000140000.00001400015/12/2010 9:30:46 am0.00080700n
40.0000200000.00002000015/12/2010 9:30:45 am0.00084900n

I am probably going about doing this all wrong, but so far it is seaming to work. Let me know if there is a better way.

Artificial Neural Networks in PHP using NeuralMesh with Bayesian Inferencing using b8

I’ve been interested in Fuzzy Logic and Artificial Neural Networks (ANN) for some time now. I guess it’s the whole binary doesn’t describe our world full of shades of gray very well thing that caught my attention.

I’ve owned a book on the topic for a while now. C++ Neural Networks and Fuzzy Logic by Valluru B. Rao Well, it’s been a real long time as the examples in the book are provided on a floppy disk.

I’ve not had much chance to apply the concepts for a few reasons. Most websites don’t need AI capabilities. There haven’t any ‘easy’ PHP libraries to use. I briefly looked into porting the C++ code provided in the book over, but I am too lazy and my math skills aren’t good enough to make sure I did it right. Also, I couldn’t figure out how to take the ‘stuff’ I dealt with every day and boil it down to numerical values that an ANN can process.

I recently stumbled across a new 100% PHP based Neural Network library called NeuralMesh. http://neuralmesh.com/ http://sourceforge.net/projects/neuralmesh/

My ‘Ah-ha’ moment was when I realized that I could use the Bayesian Statistical Inferencing stuff I have been using for SPAM filtering email as the numerical input into a Neural Net.

Now that I have a tool that can take a block of text and turn it into a number, I can play with Neural nets. After all, I’ve been creating a ‘SPAM score’ for email messages for about a year now, so I got the numericalization of text down pretty well.

I have been having trouble with my spam filter lately. I’ve relied on it too much. I’ve categorized over 13,000 email messages. When a spammer changes tactics and tries something new, it can take dozens of marking the new spam messages spam before the Bayesian library can overcome all the inertia and poisoning before it’s recognized as spam.

I need to change tactics. The Bayesian library I am using, b8, is optimized for very short messages, like blog post comment spam, and not potentially large email messages. My new tact will be to feed b8 what it’s designed for, smaller chunks of text. I will give it the email subject line both in pieces and in whole. The whole subject scoring will allow it to quickly categorize messages it has seen before – fast squashing of duplicate messages. The subject pieces scoring will allow it to quickly pick up on the old ‘random bit of text’ in an otherwise rehashed subject trick. A few bits and pieces of the email header like the IP address, user agent, spam-cop score, etc. will all be picked out and fed into b8 for individual scoring as well.

This is where NeuralMesh comes in. I can feed all these different spam scores into the Neural Network as separate inputs and let the ANN figure out what scores are meaningful and what combinations are not. There is always a human that decides if any given email is truly a good email or if it is spam, so I have a built in feedback loop. This should allow the ANN to quickly learn all it needs to know about spam.

Unfortunately, I can’t just jump right in and expect good results. The first thing I have to do is create the multiple scores and let the Bayesian Inferencing get ‘learned up’ on what’s spam and what’s not. I will continue to rely on the current, overworked, filter while the new set is watching over it’s shoulder and learning.

If I don’t do this, I will be feeding the ANN un-educated guesses by the new Naive Bayesian filters which is basically gibberish the first few ANN learning sessions. At my email usage rate, after a week or two b8 will have things pretty much figured out and I can start training the ANN with meaningful numbers.

One of the things I will be doing with the ANN is adaptive training. Each time a human makes or confirms a spam/not spam decision, I feed that back into the ANN for further training.  When you present training data to a ANN, you tell it to train using X number of iterations until it has figured things out. Being that my Spam to not spam ration is so out of proportion,  it is very heavily biased towards spam.  So, the less spamy a message is, the more iterations I allow the spam filter to train on. Conversely, the more spammy it is, the fewer training iterations a message gets – to the point that if it’s 99.5% or more  likely to be spam, I don’t train the ANN at all. It’s just telling it what it already knows.

Now, this process doesn’t apply to just spam. By training b8 different categories for a given text string, it can do a lot of different categorizations for me.

In practice, it appears to be working as I have planned. Time will tell if this is a good approach or not.

Blog Updates

It’s amazing how time flies. It seems like I don’t blog anymore, even though I think about doing it a lot.

WordPress has finally been updated to the latest version. I was leary at first to do it, I have so many goofy plugins installed, that I was afraid they were not going to work. They all seem to, so that was a waste of a perfectly good worry.

My tweets finally work correctly. I use the Automatic Category Excluder plugin, which I had set up incorrectly, so it was squashing my tweets posts. I had to turn ALL my plugins off, and turn them back on one by one until I figured out what plugin was doing it. Than it was just a few seconds to see what was wrong. The posts are hidden on the home page and RSS feed, I figure if people are coming to my blog, they don’t want to see my twitter chitter unless they followed me on twitter. But, being that it’s the bulk of my content these days, it is visible on all other sections of the site.

Android Crash and Burn

So, today, I was catching up on my web-development news, and saw that Google reader had an android version of Google reader that came out. So, I was trying to update that at the exact same time I got a phone call.

Now, my phone’s been running slow for a while now. I have it full to the top with all sorts of useful apps.

Well, the phone must have decided enough’s enough. It slowed down right to a crawl. I told it to power off, as many times this helps a little.

It never restarted.

Plus, I start getting tons of error emails from my webserver. So, I work on figuring out why my webserver is freaking out, while watching my phone reboot, over and over and over again. This means I am freaking out. My phone’s been bricked! Turns out the server thing was something trivial. I tweaked the logging code so it is more self evident why it’s going nuts, and finished up my work day.

I tried to reboot the phone, I tried to soft-reset the phone. I ended up hard-resetting the phone. This sucks.

Well, it had been getting pretty slow as of late.

The phone comes back up with a clean slate and just doesn’t want to work. I set it up on my WiFi.  It can’t seem to connect to anything, no authentication to my Google account, nothing. I cross my eyes and reboot the phone. It comes up and seems to work better now.

I delete all the crap apps off my desktop that Sprint put on there. (Why can’t I DELETE these dumb things off my phone? I don’t have room for them on my device!)

I open up the Marketplace app, and to my immense joy, it remembers all my installed apps! So, I get to installing all my ‘necessary’ apps. The ones that make my phone useful to me.

  • Google Maps (update)
  • Street View for Google Maps
  • Google Reader (what I was trying to install when this all started)
  • Square
  • iTriage Mobile Health
  • Handcent SMS
  • Lookout Mobile Security
  • Where’s My Droid
  • First Aid
  • Twitter
  • Radar Now!
  • Google Voice (totally killer app for the android)
  • MomentFlash
  • PdaNet
  • SMS Backup
  • Wifi Analyzer
  • Barcode Scanner
  • Facebook for Android
  • The Weather Channel (update, I wish I could uninstall this)
  • Compass

The amazing thing is that it took only 20-30 minutes for all my contacts to be back on the phone, my core applications that I use every day re-installed, and my life back. Sure, I have some more time getting things tweaked just right, like the background screen. But that’s trivial. I don’t feel like I am missing my right arm or something, just a finger or two. This is how setting up a new phone (or rebuilding a crashed one) ought to work. Automagically.

Google Android is a flipping awesome OS as far as I am concerned.

Insulation Prep Work

Before we bought the house, the REALTOR called Nicor and found out what the gas bill was for the prior year. $99 a month. $1200 bucks to heat the house for the year – it was basically vacant but heated for a year before we moved in. A quick peek up the attic access showed that the insulation was basically missing.

My Attic
My Attic

The fix is to insulate the attic of course. Not as simple as it sounds…

The insulation is a mess!
The insulation is a mess!

The bathroom fan vented into the attic. Well, into a rafter in the attic, really.

The bathroom fan vented into the rafter.
The bathroom fan vented into the rafter. I added some ducting to vent it elsewhere.
The bathroom now vents out towards the gable end attic vent.
The bathroom now vents out towards the gable end attic vent.

There was some jacked up electrical up there and a recessed lite I needed to shield from the insulation.

Wireing Mess
Wireing Mess

While I was WAY in the back finishing up the bathroom vent, I just about fell into a hole in my attic. No, not fell and made a hole in my ceiling, I found a hole in my attic. I know, I thought the same thing, who keeps a hole in their attic? Well, I have one. The front entry is lower then the rest of the house, so the attic drops down at that point.

Hole in the attic
A man-trap in my attic? No, just a lower section of ceiling on the main floor.

It's a big hole in my attic
It's a big hole in my attic
I insulated the exterior walls and the 'floor' to keep heat in the house.
I insulated the exterior walls and the 'floor' to keep heat in the house.
I cut a few pieces of scrap wood.
I cut a few pieces of scrap wood.
And made a floor for my attic to cover the hole. I will blow insulation over the top of this.
And made a floor for my attic to cover the hole. I will blow insulation over the top of this.

The next step is to actually blow in the insulation. That is NOT going to be fun in this short attic.

WordPress Appliance - Powered by TurnKey Linux