MACHINE LEARNING

6.2 DECISION TREES AND THE ID3 ALGORITHM

The Iterative Dichotomizer 3, (ID3), algorithm is an inductive learning
algorithm developed by Quinlan, ([24]), which deduces generalized
knowledge from a set of training instances. The derived knowledge is
expressed in the form of classification (decision) trees. Variations of ID3 have
been successfully used for many applications including the learning of the
knowledge base for systems which learn chess end games and predict soybean
disease classifications. Many of the first commercial automated knowledge
acquisition systems, like the VP-Expert, Xi Plus, 1st-Class, ExpertEase and
Knowledge Quest were based on variations of the ID3 algorithm.

6.2.1 The ID3 algorithm

Given a goal attribute, (an attribute for which we want to learn how the other
attributes affect its value), the ID3 algorithm constructs a classification tree
in a recursive, top down, divide and conquer fashion, as follows:

Begin with the whole set of instance vectors at the root node and
progressively separate them into subsets, one subset for each value in the
domain of the chosen "branching" attribute, in such a way that all instances in
the same subset share the same value of the "branching" attribute. Each
generated subset forms a child node. If the instances contained in a node do
not share the same value for the goal attribute, the node is expanded further
by using another branching attribute, otherwise it becomes a leaf. The process
terminates when there are no nodes that can be expanded further.

To keep the branching factor of the classification tree manageable, a
continuous or multivalued discrete domain is usually divided into a small
number of subdomains. For example, if the age of a company in the data set
ranges between one and ninety years, we may want to convert it into three
subranges, old, average and new, where "new" is any age less than ten years,
"old" any age greater than 60 years and "average" ranges in between.

We give below an example of how a classification tree would be
constructed for the following sample loan approval database.

LOAN
INCOME
HOME-OWNER
DEBT

no
low
no
high

no
low
no
low

no
low
yes
high

no
average
yes
high

no
average
no
low

no
average
no
high

yes
average
yes
low

no
high
no
low

no
high
no
high

yes
high
yes
high

Suppose we want to determine if an individual's loan application
should be approved or not based on the determining attributes income, home
ownership and debt. By constructing a knowledge base of rules with
hypothesis the goal attribute "loan" and premises conditions on the other
attributes, a "profile" of the successful applicant will be constructed.

Of course, statistical techniques could be used or a neural net could
be trained on the data to predict the eligibility for a given applicant, but this
approach will lack explanation facilities and will not provide any better
understanding of the underlying relations between attributes and how they
affect each other, nor does it allow the justification to the applicant of a
recommendation to approve or reject the loan.

Let us now apply the ID3 classification tree process on the database
given above. Suppose that we first choose the attribute debt as the branching
attribute. Based on this attribute, we split the set of data tuples into two
groups, one with "debt" equal to high and one with "debt" equal to low.

Both child nodes need now to be split, since there is no consensus on
loan approval being yes or no in either leaf. If income is chosen next as the
branching attribute, we get the second level of the tree given below. Four out
of the six leaf nodes in the tree given below, need not be expanded further, as
the vectors in the subsets corresponding to each node share the same value for
the goal attribute.

If "home-owner" is chosen to expand the remaining two leaf nodes, the
following tree will be derived:

Each path from the root to a leaf corresponds to a rule. The rules
derived from the tree given above are as follows:

loan(no):-debt(high),income(low).
loan(no):-debt(high),income(average).
loan(yes):-debt(high),income(high),home_owner(yes).
loan(no):-debt(high),income(high),home_owner(no).
loan(yes):-debt(low),income(average),home_owner(yes).
loan(no):-debt(low),income(average),home_owner(no).
loan(no):-debt(low),income(low).
loan(no):-debt(low),income(high).

It should be clear at this point that the choice of the branching
attributes affects the resulting decision tree. For example, if we choose to first
branch on "debt" followed by "home-owner" the following decision tree is
derived:

This tree corresponds to the rule set:

loan(no):-debt(high),home_owner(yes),income(average).
loan(no):-debt(high),home_owner(yes),income(low).
loan(yes):-debt(high),home_owner(yes),income(high).
loan(no):-debt(high),home_owner(no).
loan(yes):-debt(low),home_owner(yes).
loan(no):-debt(low),home_owner(no).

Both rule sets have a classification accuracy of 100 percent on the
vectors in the data set, but in the second set of rules, the rules are shorter
therefore more general, and there are fewer of them. The question now arises,
how to determine the order in which to consider the branching attributes so
that the derived tree is one of the smallest possible depth and branching factor.
The brute force approach for finding the optimum tree, by producing the trees
for all possible permutations of branching attributes and selecting the "best,"
causes combinatorial explosion. The problem of optimizing a decision tree is
an NP complete problem. The ID3 algorithm proposes a heuristic solution to
this problem. The information theoretic measure of entropy is used in order
to select the branching attribute that results in the optimum tree.

The concept of entropy is introduced in the following section.

6.2.2 Entropy

Given a completely expanded classification tree and a set of attribute values
corresponding to a path from the root to a leaf, the "uncertainty" as to the
value of the goal attribute has been reduced to zero. When no attribute values
are known for an instance we have maximum uncertainty about its
classification, while the more attribute values are known the more the
uncertainty is reduced. In order to derive a classification tree of the smallest
depth, a greedy heuristic can be employed where the attributes which cause
the greatest reduction in uncertainty of classification are used first to expand
the tree. ID3 uses the concept of entropy as a quantitative measure of the
"uncertainty" of classification induced by an attribute.

The uncertainty, as to which of several events is going to occur (for
example, whether the loan application will be approved or not), is a function
of the probabilities of those events. In what follows, let us assume that
H(P1,P2,...,Pm) is such an uncertainty function, where Pi's are the probabilities
of m events.

An uncertainty measure must satisfy the following requirements:

1. If the number of equiprobable choices for the outcome of an
experiment is increased, then the uncertainty about the
outcome is increased. For example,
H(1/2,1/2) is less than H(1/4,1/4,1/4,1/4)
2. The uncertainty as to the joint occurrence of two independent
events equals the sum of the uncertainties as to their
individual occurrences.
3. The uncertainty measure must be a continuous function of the
probabilities involved. In other words, a small change in
the probabilities must be associated with a small change in
uncertainty.

It has been proven that there is only one function which satisfies
the given conditions, namely the entropy.

Given m possible outcomes of an experiment, having respective
probabilities of P1,P2,...,Pm, the uncertainty function is given by:

H(P1,P2,...,Pm)= -a*(SUM i Pi*log(Pi))

where a is an arbitrary positive number, and the logarithm base
is greater than one. This function is known in information
theory as the entropy of an information source. In the
following examples, we use the value of 1 for a
and a logarithmic base equal to 2.

The entropy of an event measures our uncertainty about the
possible outcome of the event. This measure can be extended to tell
us how much information is gained regarding an event Y if we are told
that a second event X has occurred. If Xi is the ith possible outcome
for X and Yj the jth possible
outcome for Y, and P(Yj /Xi) is the corresponding conditional
probability, then the conditional uncertainty of an event is
defined as follows:

H(Y/X)= -SUM i,j P(Yj /Xi)*log[P(Yj /Xi)]

It is always the case that H(Y/X) is less or equal
than H(Y). If X and Y are independent, then H(Y)=H(Y/X).
In general, the difference between the uncertainties,
H(Y)-H(Y/X), is a measure of the amount of information contained in X
about Y or viewed another way, how much influence the outcome of
event X has on event Y.

6.2.3 Using entropy to optimize the classification tree

Applying the concept of entropy to the decision tree optimization
problem, we find that if an object can be classified into one of
several different groups, the entropy is a measure of the
uncertainty of the classification of that object.
Mathematically, if an object can be classified into N classes,
C1,...,Cn
and the probability of the object being in class i is P(Ci),
then the entropy of classification C is

H(C)= - SUM i P(Ci)*log(P(Ci))

What is needed, in order to determine the attribute on which to
further subdivide a node, is the entropy (uncertainty) of
classification after an attribute is used. The decision tree
will then be extended using the attribute
that results in the smallest entropy of classification.

The first step in calculating the entropy of classification,
H(C/A), for a particular attribute A, is to split the data table
into subtables in such a way that tuples in a subtable have the
same value of the partitioning attribute.

The entropy of each subtable, H(C/aj), is calculated for each value
aj of the attribute A. For each possible outcome aj of
attribute A, H(C/aj) is given by the expression:

H(C/aj)= - SUM i P(ci/aj)log[P(ci/aj)+ n]

The function P(ci/aj) is the probability that the class value is
ci when the splitting attribute value is aj. The number n is an
adjustment factor inserted
so that the log is defined even when P(ci/aj)=0 and log[P(ci/aj)] is
not defined.
The overall entropy of classification, H(C/A), is then given by the
weighted average of the entropy for each value aj of the
attribute. Mathematically this is expressed as

H(C|A)= SUM j P(aj)*H(C|aj)

where m is the total number of possible values for the attribute A.

We now illustrate the use of entropy for constructing decision trees.
Consider the loan approval database. If the attribute "debt" is used
to subdivide the patterns into two sets, one for each domain value of
"debt", we get:

LOAN
INCOME
HOME-OWNER
DEBT

no
low
no
high

no
low
yes
high

no
average
yes
high

no
average
no
high

no
high
no
high

yes
high
yes
high

yes
average
yes
low

no
high
no
low

no
low
no
low

no
average
no
low

We can now calculate the entropy of each subtable:

H(loan/debt=high)=
-[P(loan=yes/debt=high)log(P(loan=yes/debt=high))
+P(loan=no/debt=high)log(P(loan=no/debt=high))]=
-[(1/6)log(1/6)+(5/6)log(5/6)] = 0.196

H(loan/debt=low)=
-[P(loan=yes/debt=low)log(P(loan=yes/debt=low))
+P(loan=no/debt=low)log(P(loan=no/debt=low))]=
-[(1/4)log(1/4)+(3/4)log(3/4)]=0.244

In order to find the entropy of the entire table after the split,
H(loan/debt), we must take the sum of the entropy for each of the values of the
attribute and multiply it by the probability that the value will appear in the
table.

H(loan/debt)=
P(debt=high)*H(loan/debt=high)+P(debt=low)*H(loan/debt=low)=
(6/10)*0.196 +(4/10)*0.244= 0.215 .

If we perform these same calculations for the other attributes in our
example, we find that:

H(loan/income)=0.180, and
H(loan/home_ownership)=0.120.

Since H(loan/home_ownership) yields the smallest entropy and thus the least
uncertainty, "home_ownership" is the best attribute to select for the initial
split. Indeed by using this attribute we get right away the very short rule

IF (home_ownership=no) THEN (loan=no).

At each node, this decision process is repeated, in order to determine which
attribute is to be chosen for further expansion. The reader can verify that
"income" should be selected next, followed by "debt", and that the derived
classification tree corresponds to the rules:

IF (home_ownership=no) THEN (loan=no).
IF (home_ownership=yes and income=low) THEN loan=no,
IF (home_ownership=yes and income=high) THEN loan=yes,
IF (home_ownership=yes and income=average and debt=low)
THEN loan=yes
IF (home_ownership=yes and income=average and debt=high)
THEN loan=no.

6.3 LEARNING FROM NOISY DATA

In many applications, the learning data set is imperfect. One common problem
is that errors in attribute or class values and missing values can occur. In such
cases, we say that the data is noisy. Inducing decision trees from noisy data,
using the ID3 tree induction algorithm, gives rise to two problems: first,
induced trees may unreliably classify new objects, and second, induced trees
tend to be large and thus hard to understand. Noise makes the learning task
more difficult and in order to effectively learn in a noisy domain, the
requirement that learned concept descriptions match all the positive examples
and no negative ones, is usually relaxed. The learned description is permitted
to misclassify some of the learning objects. As an example, consider a
situation in which we are to extend a node of a decision tree and let S be the
current subset of objects in the node. Suppose that there are 100 objects in S,
99 of which belong to class C1 and one of which belongs to class C2. Knowing
that there is noise in the learning data set, and that all of the objects in S agree
on the values of the attributes already selected up to this point, it seems
plausible that the object in class C2, is in S due to an error in the data. If so, it
is best to ignore this object, terminate expansion of that node in the decision
tree and make it a leaf labeled with class C1. Since the original ID3 algorithm
would in this situation further expand the decision tree, by stopping at this
point we have pruned a subtree of the complete decision tree. Tree pruning is
the key to coping with noise in tree induction algorithms. An algorithm may
effectively prune decision trees by using some criterion that indicates whether
to stop expanding the tree or not. The stopping criterion would typically take
into account the number of examples in the node, the prevalence of the
majority class at the node, etc. This kind of pruning, accomplished through
stopping the tree expansion if certain criteria are met, is called
construction-time pruning (pre-pruning). In an alternate method of pruning,
called post-pruning, the whole decision tree is constructed first and then
pruned, starting at the leaves and moving upwards.

6.3.1 Construction time pruning

Assume that the generated decision tree has become very "deep" (i.e., there
are some long paths from the root to the leaves) and that most nodes, far away
from the root, contain instances with the vast majority sharing the same value
of the goal attribute, except maybe a few. This phenomenon indicates that
those nodes should contain instances of a single class, and the small
percentage of instances in the other class is probably due to noise. These
nodes are created in the decision tree because the induction algorithm tries to
overfit the training set. A good strategy is to stop branching when this
situation occurs.

As an example of a criterion, which can be used to determine when a
node should be pruned, consider the depth pruning ratio. The depth pruning
ratio is defined as follows:

R(a,d)=exp(a,d)

where the depth pruning threshold, a, is a constant ranging between 0 and 1,
and d is the depth of the node under consideration for expansion (i.e., the
number of branches from the tree root to the current node). During the
induction of a decision tree, the percentages of positive instances and negative
instances are compared with the depth pruning ratio at each node. When either
of them is greater than or equal to the depth pruning ratio, this node is labeled
as a leaf without further branching.

Based on the definition of the pruning ratio, it may be necessary to
keep partitioning attribute nodes having nearly 100 percent instances of one
class, when the nodes are close to the root. This can prevent excessive pruning
and therefore ensure accuracy, which is guaranteed by the small depth of these
nodes. On the other hand, if those attribute nodes are far away from the root,
branches below them are probably unnecessary. The depth pruning ratios
decrease as the depth of the node increases. Eventually, the percentage of
positive (or negative) instances will exceed the depth pruning ratios, and these
nodes will be labeled as leaves.

It is crucial to have a good depth pruning threshold. A bad threshold
will lead to either over-pessimistic pruning (e.g., a=1) or excessive pruning
(e.g.,a=0). Whereas a good one can both prune the noisy branches effectively
and avoid excessive pruning. Experimentation with different threshold
choices is needed, until a satisfactory decision tree is derived.

For examples of other existing pruning techniques like retrospective pruning
and the chi-squared method, the reader is referred to .

6.3.2 Post Pruning

We present here a post pruning algorithm, which was introduced in the
Assistant 86 learning system, ([4]). The key decision in post-pruning is
whether to prune a subtree or not. This decision is based on classification
error estimates of the decision subtree.

Let S be the set of objects classified into the root of the subtree
considered for pruning, and K be the number of classes (size of the domain of
the goal attribute). Suppose that there are N instances in S, the majority class
in S is C, n out of N instances in S belong to C, and the rest of the instances in
S belong to other classes. If we prune the subtrees of this node and label the
node with its majority class C, the result will be the introduction of a
classification error. Under the assumption that all classes are equiprobable
and that the prior distribution of the probabilities of the classes is uniform, the
expected probability of the classification error at that node E(S) for set S is:

E(S)= (N-n+K-1)/(N+K)

Such an error estimator can be directly applied to estimate the
classification error at each leaf of a decision tree: Error(leaf)=E(S), where S
is the set of instances at the leaf. This error estimator is easily generalized to
estimate classification errors of non-leaf nodes in a decision tree as follows.

Let Node be a non-leaf node of a decision tree with children Node1,
Node2, etc. Let the probabilities of branches from Node to Nodei be Pi. The
probabilities Pi can be approximated by relative frequencies of the sets of
instances contained into nodes Nodei. Then we define:

BackedUpError(Node) = i Pi*Error(Nodei)

We can now compare the node's "static" error with its backed-up error. In the
case that the static error is lower, then we prune the node's subtree.

This formula suggests the following pruning algorithm to optimize the
estimated classification error. Given a complete (unpruned) decision tree, the
algorithm starts at the bottom of the tree and progresses toward the root of the
tree, backing up the error estimates and pruning inferior subtrees. This
algorithm guarantees that all of the subtrees are pruned optimally with respect
to the classification error estimates. As a result, it will minimize the expected
error at the root of the tree; that is, the expected error of the whole tree. Of
course, the pruned tree obtained this way is only optimal with respect to error
estimates.

Figure 6.1 shows the pruning of a decision tree using the procedure
described above. In the figure, the left-most leaf of the unpruned tree has class
frequency [3,2], meaning that there are three objects of class C1 and two
objects of class C2 contained into this leaf. The static error for these class
frequencies is:

E = (N-n+K-1)/(N+K)= (5-3+2-1)/(5+2)= 0.429

Similarly, for the left-hand child of node b, the error estimate is 0.4 and for the
right child the error is 0.2. For node b, the static error estimate is:
E(b) = (11-9+2-1)/(11+2)=0.230.
The backed-up error estimate for b is:
backup_error(b)=(3/11)(0.4)+(8/11)(0.2)=0.254.
Since the static error of node b is smaller than the backup, the subtrees of node
b will be pruned making b a leaf. The process is repeated for node a.

6.4 VERSION SPACE SEARCH AND CONCEPTUAL CLUSTERING

We present an algorithm, ([21]), which is a combination of the candidate
elimination algorithm, ([17]), and the conceptual clustering technique ([16]).

Assume that each training example is expressed as a vector of
attribute-value pairs. The attribute values can be nominal or categorical in
nature.

An object is defined as an instance vector of attribute-value pairs. For
example, O=(age=new, competition=no, product=tv, profit=up) is an object.

A concept is a generalization of an object and is described by the
same description language as objects, but attributes can assume a set of values
instead of just a single value. For example, C=(age={new,old},
competition=no, product={tv,vcr}, profit=up) is a concept, which contains the
object O as an instance. Each concept corresponds to a rule. For example,
given that the goal attribute is the profit, concept C can be stated as: If a
company is old or new, and there is no competition, and the product is TVs
or VCRs, then the profit is up.

A concept A is more general than a concept B or A covers B (A>B)
iff each object which belongs to the concept B also belongs to the concept A.
The relation "more general than" defines a partial order on the set of all
possible concepts defined on an attribute set. The search techniques of AI can
be used to search the lattice of concepts (the version space) in order to find a
most general, complete and consistent concept covering the set of training
examples.

A concept A is complete with respect to a set of training examples iff
it covers all possible examples in the set.

A concept A is consistent with respect to a set of training examples
iff it contains none of the negative examples.

A concept G is a maximally specific generalization of a concept A iff
G is more general than A and there is no concept G' such that G>G'>A.
Consider the concept A=(age=midlife, competition=no, product_type=VCR).
Maximally specific generalizations can be obtained by assigning to one of the
attributes one more value than the ones it already has. For example, the
concept G=(age={midlife,new}, competition=no, product_type=VCR) is one
of the maximally specific generalizations of A, since there is no concept less
general than G which covers A.

The algorithm given below is a variation of the candidate elimination
algorithm ([17], [19], [20]). It is a bottom up algorithm, in contrast to ID3.
ID3 starts with the whole set of examples or equivalently with the most
general concept, ( ), and proceeds by specializing. The algorithm given below,
starts with positive instances (the most specific concepts) and gradually
generalizes them, so that they remain consistent. Heuristics are used to guide
the search through the potentially prohibitive size of the version space.

Algorithm:

Initialize the concept set (output) to empty.
REPEAT
Get a positive example.
IF the example is covered by any of the concepts in the current concept set THEN go to LOOP
else
begin
the example becomes the maximally specific concept at the root level of a tree.
Current_level=1
REPEAT
For each concept node in the Current_level find all its maximally specific generalization concepts,which become its children in the tree, at level:
Current_level=Current_level+1.
Delete the concepts from Current_level that are not consistent.
UNTIL all nodes are leaves
Insert all concepts corresponding to tree leaves in the concept set.
end
LOOP UNTIL no more positive examples.

If m is the number of attributes, then the size of a concept tree can be
as big as 1+m+m(m-1)+m(m-1)(m-2)+...+m(m-1)...2.1� m!. To avoid the
combinatorial explosion problem, heuristics can be used to guide the
expansion of the trees. A fitness measure can be used to evaluate the potential
usefulness of a node and only the best K nodes of a level are expanded each
time. Let A be the attribute which will give the smallest uncertainty (entropy)
of classification about the value of the goal attribute. The tree nodes, that will
be expanded by the algorithm, will be the ones derived from their parent by
extension of the attribute A.

Consider for example, the set of object vectors (from [26]), where the
goal attribute is profit=up:

PROFIT
AGE
COMPETITION
PRODUCT_TYPE

down
old
no
software

down
midlife
yes
software

up
midlife
no
hardware

down
old
no
hardware

up
new
no
hardware

up
new
no
software

up
midlife
no
software

up
new
yes
software

down
midlife
yes
hardware

down
old
yes
software

The algorithm would expand the first positive instance as follows:

(midlife, no, hardware)
|
|
___________________________________________ | | | |
(midlife,no,{hardware,software}) | | |
(midlife,{yes,no},hardware) | | ({midlife,old},no,hardware) | ({midlife,new},no,hardware)

The second, third and fourth concepts at level 2 of the tree cover the
negative instance (down,midlife,yes,hardware), so they are eliminated from
further consideration. The first node is expanded by its maximally specific
concepts as follows:

(midlife,no,hardware)
|
(midlife,no,{hardware,software})
|
_____________________________|______________________
| | |
(midlife,{yes,no},{hardware,software}) | |
({midlife,new},no,{hardware,software}) |
({midlife,old},no,{hardware,software})

The first and third of the leaves are inconsistent (they cover negative
instances), so the concept set after the first positive example is the concept
({midlife,new},no, {hardware,software}). This concept corresponds to the
rule: IF age=(midlife or new) and competition=no THEN profit=up.

Repeating the process for each positive example, will also yield the
rule: IF age=new THEN profit=up.

As another example, consider the following set of object vectors,
where the type of airplane is the goal attribute.

PLANE
TYPE
ENGINE
TYPE
WING
PLACED
WING
SHAPE
TAIL
BULGES

C130
prop
high
regular
regular
under
wings

C141
jet
high
swept
T-tail
aft_wings

C5A
jet
high
swept
T-tail
none

C747
jet
low
swept
regular
aft_cockpit

With a branching factor of K=1, and the entropy used as a heuristic measure,
the algorithm will derive the same rule set as ID3, namely:

IF bulges=under_wings THEN airplane=C130
IF bulges=aft_wings THEN airplane=C141
IF bulges=none THEN airplane=C5A
IF bulges=aft_cockpit THEN airplane=C747

For K=2, three more rules will be derived in addition to the rules above:

IF wing_shape=regular THEN airplane=C130
IF engine=prop THEN airplane=C130
IF wing_position=low THEN airplane=C747

Finally we give an example where ID3 fails, but the conceptual
clustering algorithm succeeds. Consider the following database containing
contradictions:

DISEASE
EYES
TEMPERATURE
NOSE

none
clear
normal
clear

cold
bloodshot
high
runny

flu
bloodshot
high
runny

ID3 fails, while the algorithm described above gives the following set of rules:

IF nose=clear THEN disease=none (for branching factor K=1)

IF nose=clear THEN disease=none and
IF eyes=clear THEN disease=none (for branching factor K=2).

6.5 CASE BASED REASONING

Case-based reasoning (CBR) is a form of analogical reasoning which has
enjoyed increased popularity as a developmental methodology due to its
avoidance of the knowledge acquisition bottleneck and the natural way in
which it integrates with object oriented environments. In case-based
reasoning, the system recalls previous situations similar to the current one
and adapts them to help solve the current problem. The existing descriptions
of previously encountered problems are known as cases. Cases are usually
implemented as objects in an object oriented environment even though other
knowledge representation schemes (for example attribute-value pairs, rules
or conceptual graphs) have also been used depending on the domain
characteristics.

The development of a CBR system involves the collection of a
depository of cases, deciding which features are important, and choosing an
appropriate case representation scheme. The CBR system consists of the case
database, and subsystems for case indexing, case retrieval, and case
adaptation.

When a new case is presented, the case retrieval subsystem searches
the indexed database of cases for the most similar case. The adaptation
subsystem uses the retrieved case as a model in order to create a solution for
the new problem, to warn the user of possible failures that have occurred in
the past, and to interpret the current situation. Two types of adaptation are
possible. In structural adaptation the solution of the retrieved case is
modified to fit the new case, while in derivational adaptation a new solution
is created by using the same process as used in the retrieved case. If no case
can be retrieved which is sufficiently similar to the current case, then the
current case together with a solution supplied by a human expert is stored and
indexed by the case indexing subsystem into the database of cases, thus
enabling the system to learn.

Several techniques are available for indexing a database of cases
including conceptual clustering, inductive indexing, and nearest neighbor
indexing. In inductive indexing an algorithm such as ID3 or Michalski's
conceptual clustering is used to create a classification tree or a clustering
which groups sets of similar cases together. When a new case is presented, the
values of its attributes determine the path down the classification tree that
should be followed or the cluster that contains the most similar cases to the
new one. In the nearest neighbor indexing scheme, each feature of a case is
assigned match and mismatch weights by the developer. The match weights
indicate how important and necessary is the presence of this feature in order
to call two cases similar. The mismatch weight indicates how important to
similarity is the absence of this feature from one of the compared cases. The
weights are numerically manipulated and used by various similarity metrics
to retrieve the most similar case.

Several case-based expert system shells or shells with case-based
subsystems have become available including KwEshell, ART-IM, CBR
Express, Esteem, ReMind (see Appendix C). Application areas ideally
suitable for CBR system development include the legal domain, intelligent
tutoring, and systems design. For a thorough introduction to Case-based
reasoning, the reader is referred to the book by Kolodner, ([14]).

6.6 EVOLUTIONARY MACHINE LEARNING

The evolutionary approach to machine learning is based on the concepts of
genetic algorithms and classifier systems.

Genetic algorithms and classifier systems are an optimization, search
and machine learning technique, suitable for a variety of problems, including
NP-complete, and state-space type of problems. For a detailed introduction
to the topic, the reader is referred to Davis, ([6]), Goldberg, ([11]) or Holland,
([13]).

6.6.1 Genetic Algorithms

Genetic Algorithms are based on the concepts of natural selection and
genetics. They were introduced by Holland, ([13]), with the intention of
designing a computer system with many of the characteristics of natural
systems.

A genetic algorithm is a global search technique as it does not look
at a single point from where to expand the search for optima, but it evolves a
population of points towards the optimum solution. This feature makes them
particularly suitable for parallel processing and also reduces the probability
of getting stuck at local optima.

A genetic algorithm solves a problem as follows. An initial set of
possible solutions is chosen at random and the solutions are encoded as finite
strings of symbols. The encoded set of possible solutions is called a
population. The genetic algorithm does not need extensive knowledge of the
problem domain in order to perform an efficient search of the solution space.
It only requires a function (the fitness function) to measure the fitness of a
candidate solution. The closer an encoded string is to being a solution, the
bigger its fitness value should be. The initial population is successively
transformed to populations of greater and greater overall fitness until a
convergence criterion is met. During evolution the population size usually
remains constant. The basic genetic operators used to transform a population
to a better one are: reproduction, crossover and mutation.

In reproduction, the individuals in the population are propagated to
the next population with a probability proportional to their fitness, in such a
way that individuals of greater fitness have a bigger chance of surviving in the
next population. Many alternate ways have been proposed to achieved this,
with the simplest but still quite effective being the original one proposed by
Holland. In this approach, usually referred to as the roulette wheel approach,
the overall fitness of the population is calculated as the sum of the fitness of
the individuals. The ratio of an individual's fitness to the overall fitness
corresponds to a proportional slice of a roulette wheel. Which individuals
make it to the next generation is determined by successive spins of the roulette
wheel. For example, suppose the present population consists of the
individuals 11000, 00111 and 10101 with corresponding fitness 30, 60 and
10. Then the spinning wheel is divided into three subareas, subarea 1, 2 and
3, corresponding to 30%, 60% and 10% of its total area. The wheel is spun
three times to determine the three individuals in the next generation. If the
wheel lands in area 1 then the bit string 11000 is reproduced, if it lands in area
2 then the string 00111 is reproduced, and in area 3 then 10101 is reproduced.
Clearly the string 00111 is twice as likely as string 11000 and six times as
likely as string 10101 to reproduce.

Crossover introduces variety in the population of candidate solutions
by combining two existing individuals (parents) into new individuals
(offsprings). Many forms of crossover are possible, and in its simplest form
crossover works as follows. If k is the length of the two strings, randomly
select a position p between 1 and k. The new strings are produced by merging
positions 1 to p of parent 1 with positions p+1 to k of parent 2, and positions
1 to p of the parent 2 with positions p+1 to k of parent 1. For example,
consider the individuals 110101 and 010011. If crossover is applied at
position 3, the resulting offspring is 110011 and 010101.

Mutation introduces variation to the population by randomly altering
individuals. This can lead the algorithm out of stagnation at local minima. One
way to implement mutation is to randomly pick an individual and a position p
between 1 and k, and change the value at position p to another domain value
chosen at random.

Genetic algorithms can be analyzed by looking at how schemata
behave during the evolution of the population from generation to generation.
A schema is a set of encoded solutions. For example, the schema 11**
represents all strings of length four, with the first two bits equal to 1. The
Fundamental Theorem of Genetic Algorithms shows that short, highly fit
schemata propagate themselves through evolution in a manner of exponential
growth and converge to a solution ([11]).

Solving a problem using genetic algorithms, involves finding a way
to encode the candidate solutions as finite strings, finding an appropriate
fitness function to evaluate the goodness of a candidate solution and creating
appropriate reproduction, crossover and mutation operators for the given
problem. This process is illustrated by the following example.

Consider the boolean satisfiability problem (SAT). Given a boolean
expression of n variables, is there an interpretation (assignment to the
variables) that will make the formula true? This is an NP-complete problem
(it cannot be solved in polynomial time. As the number of variables increases,
the complexity increases exponentially). There is an obvious way to encode
the candidate solutions as strings. In fact they are binary strings of fixed
length. It takes more thought to find an appropriate fitness function. The
obvious choice of assigning a fitness of 1 to a string that satisfies the boolean
expression and 0 to one which does not, will not work. The majority of the
population elements will have fitness equal to 0. Many variations of fitness
functions for the SAT have been proposed in the literature, with one of them
being the following. To evaluate the fitness of a binary string as a candidate
interpretation to a boolean function, use the following formulas:

fitness(x)=1, for bit 1 and 0 for bit 0
fitness(�x)=1-fitness(x)
fitness(x1 ... xn)=max(fitness(x1),...,fitness(xn))
fitness(x1 ... xn)=(1/n) ifitness(xi)

For example, consider the function
b(x1,x2,x3,x4)=(x1 x2) (x3 �x4) �x1. In order to find a substitution which
satisfies the formula, the genetic algorithm proceeds as follows:

Randomly pick a population of binary strings of length 4, say: 1011,
1001, 0010, 1101. For each string evaluate its fitness:

fitness(1011)=(1/3)[max(fitness(x1),fitness(x2))+max(fitness(x3),1- fitness(x4)) +(1-fitness(x1))]
=(1/3)[max(1,0)+max(1,1-1)+(1-1)]
=(1+1+0)/3=2/3
fitness(1001)=1/3
fitness(0010)=2/3

The roulette wheel will be separated into three subsections of area
40%, 20% and 40% of the total area. During reproduction, the first and third
strings will have double the chance to reproduce than the second string.
Strings will randomly be chosen to be crossed over at a random position. For
example, if string 1011 and 0010 are crossed over at position 2, they will
produce the offspring 1010 and 0011. Mutation here is obviously necessary,
as any eventual solution must have the second bit equal to 1, and this cannot
happen from the initial population just by reproduction and crossover. As
generation after generation of candidate solutions are produced, somewhere
down the line, the random choice of which bit to mutate will be for position 2,
introducing in the population the desired 1 in position 2. Parameters such as
the population size and mutation rate (how often to apply mutation) affect the
speed of convergence of the GA.

6.6.2 Classifier Systems

Classifier systems have certain similarities with production systems as they
are also based on match-select-act cycles. But unlike production systems,
which typically select to fire one rule at a time, classifier systems can activate
multiple rules in parallel. They also differ from production systems in that
they employ competitive arbitration strategies to select which rules are to be
fired and a reward based system for evaluating the strength of individual rules.
The rule base of a classifier system continually evolves, through the use of
genetic algorithms which evolve weak rules to rules of greater strength.

A classifier system is a machine learning system which consists of
four main modules: a classifier pool, a message list, an apportionment of
credit algorithm, and a Genetic Algorithm, ([11]).

The classifier pool is a set of production rules (classifiers) written
in the form:

condition:action, strength

which represents the rule:

IF (condition) THEN (action): strength.

A single condition or a conjunction of conditions can be present in the
classifier. The strength assigned to a rule is indicative of its usefulness to the
system. The condition and action part of the rule are usually encoded as strings
of a finite symbol set, for example {1,0,#}. Both conditions and actions are
encoded as strings of the same length k. The # symbol corresponds to the
"don't care" element. If it is present in a condition, then it can be matched by
either a 1 or a 0 in the corresponding position of a message.

Messages (inputs) posted by the environment are written on a
blackboard (message list). Classifiers, whose condition part matches a
message on the blackboard, compete and the winners activate and post their
action as a message on the blackboard or as output of the system. The process
is repeated, producing a chain of classifier firings. When a classifier fires and
it has the # symbol present in its action part, the symbol, in the corresponding
position of the message that matched its condition, is assigned to the # symbol.
For example, if the message list contains the message 1101011, the classifier
11#10#1:000#1## can fire and post the message 0001111.

An apportionment of credit algorithm is used to decide which
classifiers will fire and also update the strength of classifiers in proportion to
their usefulness. An example of such a system is the Bucket Brigade
Algorithm. Initially, on the absence of any information about the usefulness
of the classifiers, all of the classifiers are assigned equal strength. Classifiers,
which are eligible to fire by matching a posted message to their conditions, bid
an amount proportional to their strength for the right to fire. The winner
classifiers (a set of classifiers posting the highest bids) fire and post their
actions as messages on the blackboard. The strength of a firing classifier is
reduced by the bid amount. The bid amount is divided, among the classifiers
which posted the messages enabling the winner to fire, and is added to their
strengths. The following difference equation determines how the strength of
a classifier is updated in time t+1 (after a firing cycle) in terms of its strength
in time t:

S(t+1)= S(t)-B*S(t)-T*S(t)+R(t)

where B is the bid rate with range 0 B 1.0. If the classifier did not fire, then
B=0. R(t) is the reward the classifier is due for posting any messages enabling
others to fire and T is the tax rate, 0 T 1. Each classifier pays to the
environment an amount proportional to its strength, to avoid predominance of
the rule population by the stronger rules. Usually T is chosen much smaller
than B, and in such a way that 0 T+B 1. It has been shown that the difference
equation given above is stable, in the sense that as t approaches infinity S(t)
approaches 0.

The quality of the rules in the classifier pool is improved by applying
genetic algorithms to the population of classifiers, with the classifier strength
used as their fitness value. The genetic algorithm can be called to improve the
classifier pool, either probabilistically, with an average number of time steps
between calls, or deterministically, with a fixed number of time steps between
calls.

6.6.3 Deriving or improving rule sets with genetic algorithms, BEAGLE

One of the first systems, to use genetic algorithms for deriving rules from data,
was the BEAGLE system (Biologic Evolutionary Algorithm Generating
Logical Expressions, Forsyth, [8]). Given a set of training examples, of the
form (attribute1=value1,..., attributen=valuen), where attributen is the goal
attribute, we wish to derive rules specifying under what conditions on the
other attributes, the goal attribute will assume a given value.

For example, consider the company data of Section 6.2.1. Assume
that the possible domain values for the attributes are:

for company age: 1 (very old), 2 (old), 3 (midlife), 4 (new), 5 (very new)
for product type: 1 (software), 2 (hardware).
for competition: 1 (if there is competition), 0 (if there is no competition).
for profit: 1(if profit is up), 2 (if profit is down).

The objective is to derive rules of the form:

IF conditions THEN profit=1.

The Beagle algorithm randomly picks a population of rule conditions.
For example:

condition 1: (age<=3)
condition 2: (product type<2) (competition=1).

The first condition corresponds to the rule:

IF company is very old or old or midlife, then the profit is up.

The second condition corresponds to the rule:

IF product is software and there is competition, then the profit is up.

Possible positions for crossover are the relational operators (<, >,
etc.) and logical operators (�, , ). In the conditions above, if the positions
of <= and < are randomly chosen for a crossover, the offspring conditions after
crossover will be (age<=2) (competition=1) and (product type<3). These
conditions correspond to the rules:

IF company is old or very old and it faces competition then profit is up and
IF product is software or hardware THEN profit is up.

If, on the other hand, the relational operator <= and the logical
operator " " were picked as the crossover points, then the following two
conditions would be produced after crossover:

age<=(competition=1)
and
product type<2 and 3 which reduces to: product type<2.

Semantically, these conditions correspond to the rules:

If the age is very old and competition is present then profit is up
and
If product type is software then profit is up.

The genetic algorithm of Beagle produces the next population of rules
from the present population as follows.

Each rule is evaluated on all the training examples and its fitness
calculated in terms of the number of examples correctly classified by the rule.
The rules are ranked in descending order of fitness and the bottom half of
weakest rules is removed. Pairs of surviving rules are randomly crossed over,
as described above, and the offsprings inserted in the population. Finally, a
few rules may be mutated at random by changing the value of an attribute or
its value in the rule. The process is repeated, producing generations of rules,
until a determined number of iterations is reached or rules of satisfactory
classification accuracy are reached.

Beagle has successfully been used to produce the knowledge base for
a number of different applications, including a system to classify glass
fragment evidence in forensic science (Evett and Spiehler, [7]).

6.6.4 Deriving a knowledge base for insurance underwriting using
genetic algorithms

Below, we give another example of how genetic algorithms were used to
automatically derive a knowledge base for insurance underwriting
(Nikolopoulos, Duvendack, [22]). The machine learning component of the
described expert system extracts underwriter rules from a collection of policy
holders' records, in order to determine when a policy should be cancelled.

Policy holders' records are given in the form of attribute-value pairs.
One of the goal attributes is the "terminate_policy" attribute which assumes
one of the values:
terminated_by_company, terminated_by_policy_holder and active_policy.
Each data record had 19 attributes. The training set was noisy, mainly because
of missing attribute values. The data set was separated into a training and a
testing set, containing 70% and 30% respectively of the data patterns
available.

Each individual of the genetic population consists of a sequence of
rules and represents a candidate knowledge base for the problem domain. The
19 attributes were labeled from 1 to 19 and each rule in an individual is of the
form: a1,a2,...,a19,gi where ai belongs to the domain of attribute i or it is equal
to *, and gi is a value of the goal attribute. The value of * corresponds to the
don't care symbol, and was introduced in order to maintain the individual's
length as a constant, while at the same time permitting rules of varying
lengths. The presence of * in position i of the rule chromosome indicates that
attribute i is not present as a premise in the rule. For example, the encoding
2*0**...*3 corresponds to the rule:

IF (number_of_claims=2 and underwriter_code=no-violations) THEN
policy=active,

since number_of_claims was the first attribute in the attribute labeling, and
underwriter_code was the third. The value 3 was used to encode the value
"active" for the goal attribute policy.

The given encoding enables all rules to be of constant length. The
number of rules in an individual is also held constant and needs to be supplied
to the algorithm. When a random initial population of individuals is created,
as well as during mutation, the don't care symbol, *, is given a higher
probability of selection, in order to ensure rules of varying lengths and
candidate knowledge bases of varying sizes. The presence of the don't care
symbol also makes the rules more general.

The fitness of an individual rule can be defined as the percentage of
instances correctly classified by the rule out of the total number of instances
which match the rule's premises. The fitness of a rule indicates the
effectiveness of the rule as a discriminator, and the strength of an individual
is the sum of the fitness of the rules in that individual. The fitness of an
individual is the ratio of its strength divided by the sum of the strengths of all
individuals.

Convergence of the genetic algorithm after 20 generations produced
a rule set which had a classification accuracy of 75 percent. This was quite
good and comparable to the performance of human underwriters. But when a
domain expert was given the rules for critique, it was observed that many of
the rules did not make sense semantically and some attributes which were not
supposed to influence the decision about terminating policy appeared
frequently in the rule premise (for example, the attribute type_of_coverage,
family_or_individual etc.). These attributes were eliminated from
consideration, and the genetic process was run again with the best rule set
produced having a classification accuracy of 61 percent, but being more
agreeable and in the line of thought of the human expert underwriters.

It is also possible to apply genetic algorithms in order to improve on
a rule set derived by another inductive learning technique. As an example the
reader is referred to [22] where a hybrid machine learning technique was used
for generating the knowledge base. The hybrid approach outperformed both
of the other approaches used individually, and is further discussed in Chapter
8.