I’ve been interested in Fuzzy Logic and Artificial Neural Networks (ANN) for some time now. I guess it’s the whole binary doesn’t describe our world full of shades of gray very well thing that caught my attention.
I’ve owned a book on the topic for a while now. C++ Neural Networks and Fuzzy Logic by Valluru B. Rao Well, it’s been a real long time as the examples in the book are provided on a floppy disk.
I’ve not had much chance to apply the concepts for a few reasons. Most websites don’t need AI capabilities. There haven’t any ‘easy’ PHP libraries to use. I briefly looked into porting the C++ code provided in the book over, but I am too lazy and my math skills aren’t good enough to make sure I did it right. Also, I couldn’t figure out how to take the ‘stuff’ I dealt with every day and boil it down to numerical values that an ANN can process.
I recently stumbled across a new 100% PHP based Neural Network library called NeuralMesh.
My ‘Ah-ha’ moment was when I realized that I could use the Bayesian Statistical Inferencing stuff I have been using for SPAM filtering email as the numerical input into a Neural Net.
Now that I have a tool that can take a block of text and turn it into a number, I can play with Neural nets. After all, I’ve been creating a ‘SPAM score’ for email messages for about a year now, so I got the numericalization of text down pretty well.
I have been having trouble with my spam filter lately. I’ve relied on it too much. I’ve categorized over 13,000 email messages. When a spammer changes tactics and tries something new, it can take dozens of marking the new spam messages spam before the Bayesian library can overcome all the inertia and poisoning before it’s recognized as spam.
I need to change tactics. The Bayesian library I am using, b8, is optimized for very short messages, like blog post comment spam, and not potentially large email messages. My new tact will be to feed b8 what it’s designed for, smaller chunks of text. I will give it the email subject line both in pieces and in whole. The whole subject scoring will allow it to quickly categorize messages it has seen before – fast squashing of duplicate messages. The subject pieces scoring will allow it to quickly pick up on the old ‘random bit of text’ in an otherwise rehashed subject trick. A few bits and pieces of the email header like the IP address, user agent, spam-cop score, etc. will all be picked out and fed into b8 for individual scoring as well.
This is where NeuralMesh comes in. I can feed all these different spam scores into the Neural Network as separate inputs and let the ANN figure out what scores are meaningful and what combinations are not. There is always a human that decides if any given email is truly a good email or if it is spam, so I have a built in feedback loop. This should allow the ANN to quickly learn all it needs to know about spam.
Unfortunately, I can’t just jump right in and expect good results. The first thing I have to do is create the multiple scores and let the Bayesian Inferencing get ‘learned up’ on what’s spam and what’s not. I will continue to rely on the current, overworked, filter while the new set is watching over it’s shoulder and learning.
If I don’t do this, I will be feeding the ANN un-educated guesses by the new Naive Bayesian filters which is basically gibberish the first few ANN learning sessions. At my email usage rate, after a week or two b8 will have things pretty much figured out and I can start training the ANN with meaningful numbers.
One of the things I will be doing with the ANN is adaptive training. Each time a human makes or confirms a spam/not spam decision, I feed that back into the ANN for further training. When you present training data to a ANN, you tell it to train using X number of iterations until it has figured things out. Being that my Spam to not spam ration is so out of proportion, it is very heavily biased towards spam. So, the less spamy a message is, the more iterations I allow the spam filter to train on. Conversely, the more spammy it is, the fewer training iterations a message gets – to the point that if it’s 99.5% or more likely to be spam, I don’t train the ANN at all. It’s just telling it what it already knows.
Now, this process doesn’t apply to just spam. By training b8 different categories for a given text string, it can do a lot of different categorizations for me.
In practice, it appears to be working as I have planned. Time will tell if this is a good approach or not.