Eternity Belongs to Science: August 2011

Monday, August 22, 2011

Tip: C\C++ Logical Error on Division

Logical errors are the worst nightmare, even for a senior programmer. As experience is gained such issues are getting more and more familiar to recognize and avoid them. Though tips should be always welcomed, especially for the newcomers.

Enough with the much talk! take good look at this:

void main() { float n=0; n=7/3; printf("%f", n); }

So what the printf() you think would print?

The most of you expect something like 2.333..

Though theory is pretty strict!

Your Co. Sci. Teacher:

"You may not have a float result by dividing two integers, not even when the result is stored in a float variable."

and so you should have printed: 2

The reason of this failure lays to your integer/integer division.

So in order to achieve a float result, you should write:

void main() { float n=0; n=7.0/3.0; printf("%f", n); }

Always note:

int/int == int
float/int == float
int/float == float
float/float == float

Have fun! :)

Tuesday, August 2, 2011

Hopfield nets can reconstruct distorted patterns and undo noise

Summary: Hopfield nets are one of the easier neural net models to understand -- and they can be useful, too. The main ability of the Hopfield net is to undo noise and reconstruct known patterns. Python programmer Andrew Blais is your guide to learning more about Hopfield nets, and exploring his net.py application.

Hot things cool, obviously. The house gets messy, frustratingly. In much the same way, messages are distorted. Short-term strategies for reversing these things include, respectively, reheating, cleaning, and the Hopfield net. This article introduces you to the last of these three, an algorithm that can, within certain parameters, undo noise. A very simple Python implementation, net.py, will show you how its basic parts fit together, and why a Hopfield net can sometimes retrieve a pattern from its distortion. This implementation, while limited, will still enable you to perform a number of instructive and revealing experiments on the Hopfield net.

What is your quest?

I assume that you are reading this because you have some computational problem. You have been advised that some neural net algorithm might provide a solution. Specifically, the suggestion is that you might employ a Hopfield net. I suppose further that you need a precis of the idea so you can decide whether the suggestion is feasible and warrants more exploration. The following broadly sketched application of the Hopfield net may get you started solving your problem.

Your problem begins with a base set of patterns that are encoded with -1s and +1s. They can be encoded in 0s and +1s, if necessary. These patterns could be normalized binary images of postage stamps (see Resources). The next element is a set of patterns that deviate from this base. Your quest is to create code that inputs a deviant pattern and outputs just one base pattern. Your quest then would be an algorithm that can input an encoded representation of a particular stamp, and output just one base stamp pattern. You're not searching for certainty. There is an acceptable failure rate that does not adversely affect your enterprise. For you, there is a rate at which stamps can be misrecognized that doesn't significantly interfere with your project.

If this reminds you of your problem, the following might be the beginning of the development of your solution. By the end, you should be better able to answer the basic questions. What is this Hopfield thing? How does it work? What are its limitations? What can it do for me? Do I want to spend more time investigating this?

Patterns and their distortion

Let's begin with five arbitrary patterns patterns that will be distorted and subsequently retrieved. Visually, they can be represented as ten-by-ten matrices of black and white squares. Figure 1 shows the first pattern, p1.

Figure 1. Visual representation of p1

Clicking on any of p2 through p5 in net.py displays the other patterns. For coding purposes, these five patterns are initially represented as Python lists. So, the first pattern, for example, is represented in Listing 1.

Listing 1. Pattern p1

p1 = [ [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1], [-1, -1, -1, -1, -1, -1, -1, -1, 1, 1] ]

Black and white squares correspond to -1 and +1, respectively. Such a list is subsequently converted into an array. (SeeResources for references to the Python libraries that I utilized.) For each element in such a pattern, a -1 or +1, there is a node object in an array of nodes.

A node object has three central attributes:

A node object has a value that is an element in the pattern.
A node also has an address, that is, the address of its position in an array.
Each node also has a color that makes it possible to display it.

As noted, one function of a Hopfield net is to undo noise. To implement this, there needs to be a way to introduce patter into a pattern. Clicking Add Noise does precisely this. Adding noise to p1 produced Figure 2.

Figure 2. A distortion of p1
Visual representation of p1

To introduce noise into a pattern, the Hopfield net visits each address in the array of nodes. It then picks a random number in [0,1), that is, in the interval from zero inclusive to one exclusive. If the number is below some fixed level, the net changes the node's value and color, else it leaves it undisturbed. By default, this level is set to 0.20, and so there is a twenty percent probability that any given node will change its value and color. You can use the slider to vary this probability. At zero, there is no noise, and at one hundred, the node array is simply inverted. In between, there are all the other usual suspects. Each introduces a specific degree of noise into a pattern. Because the Hopfield net is an algorithm for undoing noise, it is able to input a distorted pattern such as the one in Figure 2 and then output its original in Figure 1.

Although sometimes obscured by bad exposition, the pertinent algorithms are fairly simple to implement. In what follows, I'll walk you through an implementation of the algorithm, and then I'll explain briefly why these algorithms can undo noise.

Weights

As David Mertz and I wrote in a previous developerWorks article, An introduction to neural nets, the human brain consists of about one hundred billion neurons, and each is, on average, connected to several thousand other neurons. Neurons both send and receive varying quantities of energy. An important feature of neurons is that they don't react immediately to the reception of energy. Instead, they sum their received energies, and they send their own quantities of energy to other neurons only when this sum has reached a certain critical threshold.

When the brain is learning, it is tempting to say that it is adjusting the number and strength of these connections. There is no doubt that this is an extreme simplification of the biological facts. Simplifications can be useful, especially when they serve as a model, in this case, for implementing cybernetic neural nets. The transition from the biological to the algorithmic is achieved by translating connections into weights. (A perceptron utilizes weights in a different and perhaps more intuitive way. Before you read on here, you might want to review An introduction to neural nets.

A weight object primarily encapsulates a numerical value that represents the weight between one node and another. A weight object also possesses an address and a color. The address is its location in the weight array. The color is for visualization. Figure 3 is one possible visual representation of a weight array. Net.py (see Resources for a link) keeps track of the lowest and greatest weights, and it will display a key to the numerical values of the colors in the weight display.

Figure 3. One visual representation of a weight array

On each row of the weight array, there is a list of all the weights between one given node and all the others. The Hopfield net has two forms. One form of node has a weight to itself, and the other doesn't. Experimenting with this via net.py shows that when nodes aren't self-weighted, node arrays don't always reconstruct to themselves. Select the No Self Weight option, and try reconstructing p3 or p5. There are one hundred nodes, and so there are ten thousand often redundant weights. When nodes are self-weighted, which is the default, then there are 5050 non-redundant weights, else there are just 4950.

Figure 4. The genesis of a weight

Listing 2. Weight generation algorithm

PAT = { x: x is a RxC pattern } WA = { x: x is a (R*C)x(R*C) weight array }  For all (i,j) and (a,b) in the range of R and C: SUM = 0 For p in PAT: SUM += p(i,j) * p(a,b) WA( (R*i)+j, (C*a)+b ) = SUM

The biologically inspired concept that underlies the Hopfield net was explored in 1949 by Donald Hebb. He conjectured that if a pair of nodes are simultaneously sending their energy to one another, then the weight between them is greater than if just one sent its energy. He wrote, "When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased" (see Resources for details). In terms of the Hopfield net, when a pair of nodes has the same value, in other words, -1 or +1, the weight between them is stronger. The sum of the products of the values of all possible node pairs determines the content of the weight array. When they are the same, their product is positive and the sum is increased. In the case of different values, this sum is decreased.

Where, more specifically, do weights come from? To begin with, the Hopfield net must have access to a library or set of basic patterns. Here, these are p1 through p5. A weight's genesis begins with the Hopfield net choosing a pair of coordinates within the bounds of a basic pattern matrix. It then visits the corresponding nodes in each pattern. At each stop, it augments an ongoing sum with the product of the values of the nodes (see Figure 4). When the net has visited each pattern, it sets the value of a weight object to this sum. Given a pair of nodes at (i,j) and (a,b), it sets the value of the weight object at (i*10+j,a*10+b) in the weight array.

That is how the array of weights is constructed, but how does it function in the larger Hopfield algorithm? How does it function in pattern reconstruction?

Reconstruction

With a weight array and a distorted or noisy pattern in hand, the Hopfield net can sometimes output the original pattern. There are no guarantees, but the net is right a surprising percent of the time. It can do this either synchronously or asynchronously.

To do this asynchronously, the net traverses the distorted pattern, and at each node, N, it asks whether N's value should be set to -1 or +1.

To ascertain this, the net traverses the row in the weight array that contains all of the weights between N and the other nodes. Keep in mind that nodes may or may not be self-weighted.

At each stop in this second traversal, it forms the product of (1) the weight between N and the other node and (2) the value of the other node. As you'd expect, the net keeps an ongoing tally of these products.

Now, the net is in a position to decide. In the present implementation at least, if this sum is less than zero, the net sets the node to -1, and if it is greater than or equal to zero, the net sets the node to +1.

Figure 5. Reconstruction: No node left behind

Listing 3. Reconstruction

For every node, N, in pattern P. SUM = 0 For every node, A, in P: W = weight between N and A V = value of A SUM += W * V if SUM < 0: set N's value to -1 else: set N's value to +1

The default update is asynchronous, because the net sets a node's value just after it decides what this value should be. It would be synchronous if the net set the values of the nodes after it had made all of its decisions. In this case, it would store its decisions, and then update the array of nodes after its last decision. In net.py (see Resources), reconstruction is, by default, asynchronous, but note the option for synchronous reconstruction.

When you experiment with net.py, and when reconstruction is successful, the Hopfield net's behavior startles. One such behavior is that even when the weight array is severely degraded, it can reconstruct patterns. My simple implementation's Degrade Weights traverses the weight array and randomly sets weights to zero. Displaying the weight array provides one view of the extent of the damage. Correct reconstruction illustrates here that a Hopfield net can be far more fault-tolerant than the brain. How does it work? The mathematical story is not short. Here instead is a plot summary.

What goes on

The algorithmic details of the Hopfield net suggest why it can sometimes undo noise. As is often true in algorithm analysis, the devil is in the mathematical details. In the present case, these are difficult to picture or imagine. There are, fortunately, some closely related phenomena whose analysis sheds light on the workings of the Hopfield net.

When a marble drops into a simply curved bowl, it rolls to the lowest point. The bowl's curvature is like a rule that inputs the marble's entry point and returns the lowest point, the bottom of the bowl. A more complicated curvature would be like a function that inputs an entry point and returns one of several local lowest points.

Energy is an essential part of these simple phenomena. In both the simple and complicated cases, the entering marble has a certain degree of energy. Over time, this energy decreases. Eventually, it reaches a stable state in which it can't diminish any further. In the complicated case, there may be a lower energy level, but the marble can't reach it.

Analogously, a pattern, distorted or not, can be thought of as having a certain degree of energy. So, patterns p1 through p5 have energy levels.

Pattern energy levels

Pattern	Energy	Rank
p1	-9600.0	1
p2	-7104.0	3
p3	-6848.0	4
p4	-9472.0	2
p5	-6408.0	5

The computation of a pattern's energy level is not complicated. For every possible pair of nodes, the Hopfield net calculates the product of their values and the weight between them. The pattern's energy level is the result of dividing the sum of these products by negative two. Net.py displays the energy level of any given pattern or node array. As you reconstruct a pattern, I expect and hope that you will see the pattern's energy level decrease.

In reconstruction, the net's decision to flip a node is based on the sum of the products of the values of the other nodes and the weights between them. When the sum is less than zero, the node is set to -1, else it is set to +1. When the product of value and weight is positive, it contributes to pushing the sum above zero. But this pushes the net in the direction of deciding to set the node's value to +1. When this product is negative, the sum is pushed to or below zero. So, the net is pushed in the direction of setting the node's value to -1. The variation in weights introduces variation into the degree to and direction in which the net is pushed in making its decision. A pattern can be so distorted that the net isn't pushed in the direction of correct decisions. Positively put, the net can be pushed to a correct decision an impressively large percent of the time.

If you reconstruct any of the five patterns, you will see that each reconstructs to itself. This is as it should be, since each pattern already occupies a local lowest energy point. There is no reconstructive process that can lower the pattern's energy level. If you succeed in reconstructing a distorted pattern, the Hopfield has reduced the pattern's energy level to the level of one of the patterns. When it fails, it has reduced the distorted pattern's energy level to a spurious local low point. In either case, the energy level cannot be reduced any further. In has, in other words, reached a stable state. Describing the Hopfield net in terms of energy has an interesting and important consequence. On this basis, it can be mathematically established that repeated application of the reconstruction algorithm will eventually result in a stable pattern. (See Resources for details.)

Summary

You should be aware of the limitations on Hopfield nets. An obvious already-cited limitation is that its patterns must be encodable in an array of either -1s and +1s, or 0s and +1s. You already know that a Hopfield can stabilize at a spurious local low. This more significant limitation is that the probability that the net will stabilize at some spurious local low increases when the number of patterns exceeds approximately 14 percent of the number of nodes in the node array. In other words, for each additional base pattern, there must be about 7 more nodes. Despite such limitations, the pattern reconstruction discussed here may serve as an illustrative guide to resolving your particular computing problem. You now have the precis of the Hopfield algorithm initially promised. Should it address your requirements, you now know the superstructure of forging your own implementations. This includes the algorithms for computing a weight array, the way to reconstruct distorted patterns, and the algorithm for computing a pattern's energy level.

Resources

Download the Net.py source code from SourceForge.
Andrew implemented the Hopfield net in Python using the Tkinter and Numeric libraries. Python lets you focus on algorithms by removing the need to focus on arcane and Byzantine syntax.
Andrew Blais and David Mertz wrote an introductory article, An introduction to neural networks (developerWorks, July 2001), that focuses on the perceptron. David W. Clark's paper of the same name includes a table comparing neural networks with digital computers. See also the Wikipedia Neural network article.
Consult Attrasoft on stamp classification and stamp recognition. For fingerprint recognition, see the design study by Nick D'Arcy. For the use of artificial intelligence in business (PDF document), see the study by Tomas Nordlander.
Neural Networks Documentation from Wolfram Research gives background on Hopfield and Hopfield nets; clicking through to Section 9.2 at the end of the document (and then clicking the Next button) leads you to an interesting discussion of using Hopfield for character recognition. Or just see the Hopfield Net entry in An Idiot's Guide to Neural Networks.
The biologically inspired concept that underlies the Hopfield net was explored in 1949 by Donald Hebb. See also The Hebb Legacy (Canadian Journal of Experimental Psychology). The study of Donald Hebb's The Organization of Behavior(Wiley, 1949) repays the required heavy lifting.
You can read JJ Hopfield's 1982 paper, Neural networks and physical systems with emergent collective computational abilities (Proceedings of the National Academy of Sciences of the United States of America, 1982) online in PDF format. You can also see what interests Professor Hopfield today.
Begin your excursion into the mathematics of the Hopfield net with these resources:
- Kevin Gurney's excellent An Introduction to Neural Networks (Routledge, 1997) focuses on building the intuitions that underlay the mathematics. See Chapter 7 for a discussion of the proof that iterated application of the reconstruction algorithm results in a stable pattern. Also see Neural Nets for an early online version.
- Neural Computing: An Introduction (Hilger, 1990), by R. Beal and T. Jackson, is also very good at developing your sense of what is going on in neural nets. In particular, see Chapter 6.

Pattern learning with the back-propagation algorithm

Summary: Neural nets may be the future of computing. A good way to understand them is with a puzzle that neural nets can be used to solve. Suppose that you are given 500 characters of code that you know to be C, C++, Java, or Python. Now, construct a program that identifies the code's language. One solution is to construct a neural net that learns to identify these languages. This article discusses the basic features of neural nets and approaches to constructing them so you can apply them in your own coding.

According to a simplified account, the human brain consists of about ten billion neurons -- and a neuron is, on average, connected to several thousand other neurons. By way of these connections, neurons both send and receive varying quantities of energy. One very important feature of neurons is that they don't react immediately to the reception of energy. Instead, they sum their received energies, and they send their own quantities of energy to other neurons only when this sum has reached a certain critical threshold. The brain learns by adjusting the number and strength of these connections. Even though this picture is a simplification of the biological facts, it is sufficiently powerful to serve as a model for the neural net.

Threshold logic units (TLUs)

The first step toward understanding neural nets is to abstract from the biological neuron, and to focus on its character as athreshold logic unit (TLU). A TLU is an object that inputs an array of weighted quantities, sums them, and if this sum meets or surpasses some threshold, outputs a quantity. Let's label these features. First, there are the inputs and their weights: X₁,X₂, ..., X_nand W₁, W₂, ...,W_n. Then, there are the X_i*W_i that are summed, which yields the activation level a, in other words:

Â a = (X1 * W1)+(X2 * W2)+...+(Xi * Wi)+...+ (Xn * Wn)

The threshold is called theta. Lastly, there is the output: y. When a >= theta, y=1, else y=0. Notice that the output doesn't need to be discontinuous, since it could also be determined by a squashing function, s (or sigma), whose argument is a, and whose value is between 0 and 1. Then, y=s(a).

Figure 1. Threshold logic unit, with sigma function (top) and cutoff function (bottom)
Threshhold Logic Unit

A TLU can classify. Imagine a TLU that has two inputs, whose weights equal 1, and whose theta equals 1.5. When this TLU inputs <0,0>, <0,1>, <1,0>, and <1,1>, it outputs 0, 0, 0, and 1 respectively. This TLU classifies these inputs into two groups: the 1 group and the 0 group. Insofar as a human brain that knows about logical conjunction (Boolean AND) would similarly classify logically conjoined sentences, this TLU knows something like logical conjunction.

This TLU has a geometric interpretation that clarifies what is happening. Its four possible inputs correspond to four points on a Cartesian graph. From X₁*W₁+ X₂*W₂ = theta, in other words, the point at which the TLU switches its classificatory behavior, it follows that X₂ = -X₁ + 1.5. The graph of this equation cuts the four possible inputs into two spaces that correspond to the TLU's classifications. This is an instance of a more general principle about TLUs. In the case of a TLU with an arbitrary number of inputs, N, the set of possible inputs corresponds to a set of points in N-dimensional space. If these points can be cut by a hyperplane -- in other words, an N-dimensional geometric figure corresponding to the line in the above example -- then there is a set of weights and a threshold that define a TLU whose classifications match this cut.

How a TLU learns

Since TLUs can classify, they know stuff. Neural nets are also supposed to learn. Their learning mechanism is modeled on the brain's adjustments of its neural connections. A TLU learns by changing its weights and threshold. Actually, the weight-threshold distinction is somewhat arbitrary from a mathematical point of view. Recall that the critical point at which a TLU outputs 1 instead of 0 is when the SUM(Xi * Wi) >= theta. This is equivalent to saying that the critical point is when the SUM(X_i * W_i)+ (-1 * theta) >= 0. So, it is possible to treat -1 as a constant input whose weight, theta, is adjusted in learning, or, to use the technical term,training. In this case, y=1 when SUM(X_i * W_i)+ (-1 * theta) >= 0, else y=0.

During training, a neural net inputs:

A series of examples of the items to be classified
Their proper classifications or targets

Such input can be viewed as a vector: <X₁, X₂, ...,X_n, theta, t>, where t is the target or true classification. The neural net uses these to modify its weights, and it aims to match its classifications with the targets in the training set. More precisely, this is supervised training, as opposed to unsupervised training. The former is based on examples accompanied by targets, whereas the latter is based on statistical analysis (see Resources later in this article). Weight modification follows a learning rule. One idealized learning algorithm goes something like this:

Listing 1. Idealized learning algorithm

fully_trained = FALSE DO UNTIL (fully_trained): fully_trained = TRUE FOR EACH training_vector = <X1, X2, ..., Xn, theta, target>:: # Weights compared to theta a = (X1 * W1)+(X2 * W2)+...+(Xn * Wn) - theta y = sigma(a) IF y != target: fully_trained = FALSE FOR EACH Wi: MODIFY_WEIGHT(Wi)                  # According to the training rule IF (fully_trained): BREAK

You're probably wondering, "What training rule?" There are many, but one plausible rule is based on the idea that weight and threshold modification should be determined by a fraction of (t - y). This is accomplished by introducing alpha (0 < alpha < 1), which is called the learning rate. The change in W_i equals (alpha * (t - y) * Xi). When alpha is close to 0, the neural net will engage in more conservative weight modifications, and when it is close to 1, it will make more radical weight modifications. A neural net that uses this rule is known as a perceptron, and this rule is called the perceptron learning rule. One result about perceptrons, due to Rosenblatt, 1962 (see Resources), is that if a set of points in N-space is cut by a hyperplane, then the application of the perceptron training algorithm will eventually result in a weight distribution that defines a TLU whose hyperplane makes the wanted cut. Of course, to recall Keynes, eventually, we're all dead, but more than computational time is at stake, since we may want our neural net to make more than one cut in the space of possible inputs.

Our initial puzzle illustrates this. Suppose that you were given N characters of code that you knew to be either C, C++, Java, or Python. The puzzle is to construct a program that identifies the code's language. A TLU that could do this would need to make more than one cut in the space of possible inputs. It would need to cut this space into four regions, one for each language. This would be accomplished by training a neural net to make two cuts. One cut would divide the C/C++ from the Java/Python, and the other cut would divide the C/Java from the C++/Python. A net that could make these cuts could also identify the language of a source code sample. However, this requires our net to have a different structure. Before we describe this difference, let's briefly turn to practical considerations.

Figure 2. Preliminary (and insufficient) Perceptron Learning Model

Considerations of computational time rule out taking N-character chunks of code, counting the frequency of the visible ASCII characters, 32 through 127, and training our neural net on the basis of these counts and target information about the code's language. Our way around this was to limit our character counts to the twenty most frequently occurring non-alphanumeric characters in a pool of C, C++, Java, and Python code. For reasons that concern the implementation of floating point arithmetic, we decided to train our net with these twenty counts divided by a normalizing factor. Clearly, one structural difference is that our net has twenty input nodes, but this should not be surprising, since our description has already suggested this possibility. A more interesting difference is a pair of intermediary nodes, N1 and N2, and an increase in the number of output nodes from two to four, O1-O4.

N1 would be trained so that upon seeing C or C++, it would set y1=1, and upon seeing Java or Python, it would set y1=0. N2 would be similarly trained. Upon seeing C or Java, N2 would set y2=1, and upon seeing C++ or Python, it would set y2=0. Furthermore, N1 and N2 would output 1 or 0 to the Oi. Now, if N1 sees C or C++, and N2 sees C or Java, then the code in question is C. Moreover, if N1 sees C or C++, and N2 sees C++ or Python, the code is C++. The pattern is obvious. So, suppose that the Oi were trained to output 1 or 0 on the basis of the following table:

Intermediate nodes mapped to output (as Boolean functions)

N1	N2	O1 (C)	O2 (C++)	O3 (Java)	O4 (Python)
0	0	0	0	0	1
0	1	0	0	1	0
1	0	0	1	0	0
1	1	1	0	0	0

If this worked, our net could identify the language of a code sample. It is a pretty idea, and its practical ramifications seem to be incredibly far reaching. However, this solution presupposes that the C/C++ and the Java/Python inputs are cut by one hyperplane, and that the C/Java and the C++/Python inputs are cut by the other. There is an approach to net training that bypasses this kind of assumption about the input space.

The delta rule

Another training rule is the delta rule. The perceptron training rule is based on the idea that weight modification is best determined by some fraction of the difference between target and output. The delta rule is based on the idea of gradient descent. This difficult mathematical concept has a prosaic illustration. From some given point, a Southward path may be steeper than an Eastward path. Walking East may take you off a cliff, while walking South may only take you along its gently sloping edge. West would take you up a steep hill, and North leads to level ground. All you want is a leisurely walk, so you seek ground where the overall steepness of your options is minimized. Similarly, in weight modification, a neural net can seek a weight distribution that minimizes error.

Limiting ourselves to nets with no hidden nodes, but possibly having more than one output node, let p be an element in a training set, and t(p,n) be the corresponding target of output node n. However, let y(p,n) be determined by the squashing function, s, mentioned above, where a(p,n) is n's activation relative to p, y(p,n) = s( a(p,n) ) or the squashed activation of node n relative to p. Setting the weights (each Wi) for a net also sets the difference between t(p,n) and y(p,n) for every p and n, and this means setting the net's overall error for every p. Therefore, for any set of weights, there is an average error. However, the delta rule rests on refinements in the notions of average and error. Instead of discussing the minutiae, we shall just say that the error relative to some p and n is: ½ * square( t(p,n) - y(p,n) ) (see Resources). Now, for each Wi, there is an average error defined as:

Listing 2: Finding the average error

sum = 0 FOR p = 1 TO M:         # M is number of training vectors FOR n = 1 TO N:     # N is number of output nodes sum = sum + (1/2 * (t(p,n)-y(p,n))^2) average = 1/M * sum

The delta rule is defined in terms of this definition of error. Because error is explained in terms of all the training vectors, the delta rule is an algorithm for taking a particular set of weights and a particular vector, and yielding weight changes that would take the neural net on the path to minimal error. We won't discuss the calculus underpinning this algorithm. Suffice it to say that the change to any Wi is:

Â alpha * s'(a(p,n)) * (t(p,n) - y(p,n)) * X(p,i,n).

X(p,i,n) is the i^th element in p that is input into node n, and alpha is the already noted learning rate. Finally, s'( a(p,n) ) is the rate of change (derivative) of the squashing function at the activation of the n^th node relative to p. This is the delta rule, and Widrow and Stearns (see Resources) showed that when alpha is sufficiently small, the weight vector approaches a vector that minimizes error. A delta rule-based algorithm for weight modification looks like this.

Downward slope (follow until error is suitably small)

step 1: for each training vector, p, find a(p) step 2: for each i, change Wi by: alpha * s'(a(p,n)) * (t(p,n)-y(p,n)) * X(p,i,n)

There are important differences from the perceptron algorithm. Clearly, there are quite different analyses underlying the weight change formulae. The delta rule algorithm always makes a change in weights, and it is based on activation as opposed to output. Lastly, it isn't clear how this applies to nets with hidden nodes.

Back-propagation

Back-propagation is an algorithm that extends the analysis that underpins the delta rule to neural nets with hidden nodes. To see the problem, imagine that Bob tells Alice a story, and then Alice tells Ted. Ted checks the facts, and finds that the story is erroneous. Now, Ted needs to find out how much of the error is due to Bob and how much to Alice. When output nodes take their inputs from hidden nodes, and the net finds that it is in error, its weight adjustments require an algorithm that will pick out how much the various nodes contributed to its overall error. The net needs to ask, "Who led me astray? By how much? And, how do I fix this?" What's a net to do?

Figure 3. "Code Recognizer" back-propagation neural network

The back-propagation algorithm also rests on the idea of gradient descent, and so the only change in the analysis of weight modification concerns the difference between t(p,n) and y(p,n). Generally, the change to W_i is:

Â alpha * s'(a(p,n)) * d(n) * X(p,i,n)

where d(n) for a hidden node n, turns on (1) how much n influences any given output node; and (2) how much that output node itself influences the overall error of the net. On the one hand, the more n influences an output node, the more n affects the net's overall error. On the other hand, if the output node influences the overall error less, then n's influence correspondingly diminishes. Where d(j) is output node j's contribution to the net's overall error, and W(n,j) is the influence that n has on j, d(j) * W(n,j) is the combination of these two influences. However, n almost always influences more than one output node, and it may influence every output node. So, d(n) is:

Â SUM(d(j)*W(n,j)), for all j

where j is an output node that takes input from n. Putting this together gives us a training rule. First part: the weight change between hidden and output nodes, n and j, is:

Â alpha * s'(a(p,n))*(t(p,n) - y(p,n)) * X(p,n,j)

Second part: the weight change between input and output nodes, i and n, is:

Â alpha * s'(a(p,n)) * sum(d(j) * W(n,j)) * X(p,i,n)

where j varies over all the output nodes that receive input from n. Moreover, the basic outline of a back-propagation algorithm runs like this.

Initialize Wi to small random values.

Steps to follow until error is suitably small

Step 1: Input training vector.
Step 2: Hidden nodes calculate their outputs.
Step 3: Output nodes calculate their outputs on the basis of Step 2.
Step 4: Calculate the differences between the results of Step 3 and targets.
Step 5: Apply the first part of the training rule using the results of Step 4.
Step 6: For each hidden node, n, calculate d(n).
Step 7: Apply the second part of the training rule using the results of Step 6.

Steps 1 through 3 are often called the forward pass, and steps 4 through 7 are often called the backward pass. Hence, the name: back-propagation.

Recognizing success

With the back-propagation algorithm in hand, we can turn to our puzzle of identifying the language of source code samples. To do this, our solution extends Neil Schemenauer's Python module bpnn (see Resources). Using his module is amazingly simple. We customized the class NN in our class NN2, but our changes only modify the presentation and output of the process, nothing algorithmic. The basic code looks like:

Listing 3: Setting up a neural network with bpnn.py

# Create the network (number of input, hidden, and training nodes) net = NN2(INPUTS, HIDDEN, OUTPUTS)   # create the training and testing data trainpat = []   testpat = []   for n in xrange(TRAINSIZE+TESTSIZE):   #... add vectors to each set  # train it with some patterns net.train(trainpat, iterations=ITERATIONS, N=LEARNRATE, M=MOMENTUM)    # test it  net.test(testpat)    # report trained weights  net.weights()

Of course, we need input data. Our utility code2data.py provides this. Its interface is straightforward: just put a bunch of source files with various extensions into a subdirectory called ./code , and then run the utility listing these extensions as options, for example:

Â Â Â python code2data.py py c java

What you get is a bunch of vectors on STDOUT , which you can pipe to another process or redirect to a file.Â This output looks something like:

Listing 4: Code2Data output vectors

0.15 0.01 0.01 0.04 0.07 0.00 0.00 0.03 0.01 0.00 0.00 0.00 0.05 0.00 > 1 0 0 0.14 0.00 0.00 0.05 0.13 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.13 0.00 > 1 0 0 [...]

Recall that the input values are normalized numbers of occurrences of various special characters. The target values (after the greater than sign) are YES/NO representing the type of source code file containing these characters. But there is nothing very obvious about what is what. That's the great thing, the data could by anything that you can generate input and targets for.

The next step is to run the actual code_recognizer.py program. This takes (on STDIN ) a collection of vectors like those above. The program has a wrapper that deduces how many input nodes (count and target) are needed, based on the actual input file. Choosing the number of hidden nodes is trickier. For source code recognition, 6 to 8 hidden nodes seem to work very well. You can override all the parameters on the command line, if you want to experiment with the net to discover how it does with various options -- each run might take a while, though. It is worth noting that code_recognizer.py sends its (large) test result file to STDOUT , but puts some friendly messages on STDERR . So most of the time, you will want to direct STDOUT to a file for safe keeping, and watch STDERR for progress and result summaries.

Listing 5: Running code_recognizer.py

> code2data.py py c java | code_recognizer.py > test_results.txt  Total bytes of py-source: 457729  Total bytes of c-source: 245197  Total bytes of java-source: 709858  Input set: ) ( _ . = ; " , ' * / { } : - 0 + 1 [ ]  HIDDEN = 8  LEARNRATE = 0.5  ITERATIONS = 1000  TRAINSIZE = 500  OUTPUTS = 3  MOMENTUM = 0.1  ERROR_CUTOFF = 0.01  TESTSIZE = 500  INPUTS = 20  error -> 95.519... 23.696... 19.727... 14.012... 11.058... 9.652...   8.858... 8.236... 7.637... 7.065... 6.398... 5.413... 4.508...   3.860... 3.523... 3.258... 3.026... 2.818... 2.631... 2.463...   2.313... 2.180... 2.065... 1.965... 1.877... 1.798... 1.725...   [...]  0.113... 0.110... 0.108... 0.106... 0.104... 0.102... 0.100...   0.098... 0.096... 0.094... 0.093... 0.091... 0.089... 0.088...   0.086... 0.085... 0.084...  Success rate against test data: 92.60%

The decreasing error is a nice reminder, and acts as a sort of progress meter during long runs. But, the final result is what is impressive. The net does a quite respectable job, in our opinion, of recognizing code -- we would love to hear how it does on your data vectors.

Source:

http://www.ibm.com/developerworks/library/l-neural/

Eternity Belongs to Science

Pages