MACHINE LEARNING
          
          
          6.2 DECISION TREES AND THE ID3 ALGORITHM
          
          The Iterative Dichotomizer 3, (ID3), algorithm is an inductive learning
          algorithm developed by Quinlan, ([24]), which deduces generalized
          knowledge from a set of training instances. The derived knowledge is
          expressed in the form of classification (decision) trees. Variations of ID3 have
          been successfully used for many applications including the learning of the
          knowledge base for systems which learn chess end games and predict soybean
          disease classifications. Many of the first commercial automated knowledge
          acquisition systems, like the VP-Expert, Xi Plus, 1st-Class, ExpertEase and
          Knowledge Quest were based on variations of the ID3 algorithm. 
          
          
          6.2.1 The ID3 algorithm
          
          Given a goal attribute, (an attribute for which we want to learn how the other
          attributes affect its value), the ID3 algorithm constructs a classification tree
          in a recursive, top down, divide and conquer fashion, as follows:
          
               Begin with the whole set of instance vectors at the root node and
          progressively separate them into subsets, one subset for each value in the
          domain of the chosen "branching" attribute, in such a way that all instances in
          the same subset share the same value of the "branching" attribute. Each
          generated subset forms a child node. If the instances contained in a node do
          not share the same value for the goal attribute, the node is expanded further
          by using another branching attribute, otherwise it becomes a leaf. The process
          terminates when there are no nodes that can be expanded further. 
          
               To keep the branching factor of the classification tree manageable, a
          continuous or multivalued discrete domain is usually divided into a small
          number of subdomains. For example, if the age of a company in the data set
          ranges between one and ninety years, we may want to convert it into three
          subranges, old, average and new, where "new" is any age less than ten years,
          "old" any age greater than 60 years and "average" ranges in between.
          
               We give below an example of how a classification tree would be
          constructed for the following sample loan approval database.
          
          
          
          LOAN
          INCOME
          HOME-OWNER
          DEBT
          
          
          no
          low
          no
          high
          
          
          no
          low
          no
          low
          
          
          no
          low
          yes
          high
          
          
          no
          average
          yes
          high
          
          
          no
          average
          no
          low
          
          
          no
          average
          no
          high
          
          
          yes
          average
          yes
          low
          
          
          no
          high
          no
          low
          
          
          no
          high
          no
          high
          
          
          yes
          high
          yes
          high
          
          
          
               Suppose we want to determine if an individual's loan application
          should be approved or not based on the determining attributes income, home
          ownership and debt. By constructing a knowledge base of rules with
          hypothesis the goal attribute "loan" and premises conditions on the other
          attributes, a "profile" of the successful applicant will be constructed. 
          
               Of course, statistical techniques could be used or a neural net could
          be trained on the data to predict the eligibility for a given applicant, but this
          approach will lack explanation facilities and will not provide any better
          understanding of the underlying relations between attributes and how they
          affect each other, nor does it allow the justification to the applicant of a
          recommendation to approve or reject the loan.
          
               Let us now apply the ID3 classification tree process on the database
          given above. Suppose that we first choose the attribute debt as the branching
          attribute. Based on this attribute, we split the set of data tuples into two
          groups, one with "debt" equal to high and one with "debt" equal to low.
          
               Both child nodes need now to be split, since there is no consensus on
          loan approval being yes or no in either leaf. If income is chosen next as the
          branching attribute, we get the second level of the tree given below. Four out
          of the six leaf nodes in the tree given below, need not be expanded further, as
          the vectors in the subsets corresponding to each node share the same value for
          the goal attribute.
          
          
          
                                 Debt
                              |
               high           |              low
               ___________________________________________
            Income|                          |Income
               |                             |    
          _____________________              ______________________
          | low          | high | average      | low     | high      average|     
          |              |      |              |         |
          no,low,no,high |      |    no,low,no,low  |         |
          no,low,yes,high|      |              |         |
                         |      |              |         |
               no,high,no,high  |              |         |
               yes,high,yes,high|       no,high,no,low   |
                                |                        |
                    no,average,yes,high                  |
                    no,average,no,high            no,average,no,low
                                             yes,average,yes,low
          
          
          If "home-owner" is chosen to expand the remaining two leaf nodes, the
          following tree will be derived:
          
          
          
                                 Debt
                              |
                    high      |              low
                    ______________|_____________________
               Income    |                        |  Income
                    |                        |    
          ______________|______________      _______|_______
          | low          | high         | average | low     | high    |average  
          |         |         |         |    |    |
          no,low,no,high |         |    no,low,no,low  |    |
          no,low,yes,high     |         |              |    |
               no,high,no,high          |              |    |
               yes,high,yes,high   |         no,high,no,low |
                    |    no,average,yes,high           |              |    no,average,no,high            |              |                           
          no,average,no,low             |                           yes,average,yes,low
              home-owner |                        |
                    |                        home-owner | 
               _______|______________        _______|_______
           yes |          no  |     yes |         |  no
               |              |         |         |
          yes,high,yes,high   no,high,no,high          |         |
                                        |         |
                                      yes,average,yes,no    no,average,no,low
          
               Each path from the root to a leaf corresponds to a rule. The rules
          derived from the tree given above are as follows:
          
               loan(no):-debt(high),income(low).
               loan(no):-debt(high),income(average).
               loan(yes):-debt(high),income(high),home_owner(yes).
               loan(no):-debt(high),income(high),home_owner(no).
               loan(yes):-debt(low),income(average),home_owner(yes).
               loan(no):-debt(low),income(average),home_owner(no).
               loan(no):-debt(low),income(low).
               loan(no):-debt(low),income(high).
          
               It should be clear at this point that the choice of the branching
          attributes affects the resulting decision tree. For example, if we choose to first
          branch on "debt" followed by "home-owner" the following decision tree is
          derived:
          
          
                                    Debt
                              |
                    high      |              low
                        _______________________________________
          home-owner    |                              |     home-owner
                        |                              |    
               ____|_________________   ______________|_______
              yes   |             no    |    |   yes             |   no    
               |              |    |              |
          no,average,yes,high      |    |              |
          no,low,yes,high               |    yes,average,yes,low |
          yes,high,yes,high        |                   |
               |         no,high,no,high                    |
               |         no,average,no,high       no,average,no,low
               |income        no,low,no,high           no,high,no,low      |                             no,low,no,low
          ___________________________________________
          |average            |low      |high
          |                   |         |
          no,average,yes,high no,low,yes,high     yes,high,yes,high
               
          This tree corresponds to the rule set:
          
               loan(no):-debt(high),home_owner(yes),income(average).
               loan(no):-debt(high),home_owner(yes),income(low).
               loan(yes):-debt(high),home_owner(yes),income(high).
               loan(no):-debt(high),home_owner(no).
               loan(yes):-debt(low),home_owner(yes).
               loan(no):-debt(low),home_owner(no).
               
               Both rule sets have a classification accuracy of 100 percent on the
          vectors in the data set, but in the second set of rules, the rules are shorter
          therefore more general, and there are fewer of them. The question now arises,
          how to determine the order in which to consider the branching attributes so
          that the derived tree is one of the smallest possible depth and branching factor.
          The brute force approach for finding the optimum tree, by producing the trees
          for all possible permutations of branching attributes and selecting the "best,"
          causes combinatorial explosion. The problem of optimizing a decision tree is
          an NP complete problem. The ID3 algorithm proposes a heuristic solution to
          this problem. The information theoretic measure of entropy is used in order
          to select the branching attribute that results in the optimum tree.
          
               The concept of entropy is introduced in the following section. 
          
          
          6.2.2 Entropy
          
          Given a completely expanded classification tree and a set of attribute values
          corresponding to a path from the root to a leaf, the "uncertainty" as to the
          value of the goal attribute has been reduced to zero. When no attribute values
          are known for an instance we have maximum uncertainty about its
          classification, while the more attribute values are known the more the
          uncertainty is reduced. In order to derive a classification tree of the smallest
          depth, a greedy heuristic can be employed where the attributes which cause
          the greatest reduction in uncertainty of classification are used first to expand
          the tree. ID3 uses the concept of entropy as a quantitative measure of the
          "uncertainty" of classification induced by an attribute. 
          
               The uncertainty, as to which of several events is going to occur (for
          example, whether the loan application will be approved or not), is a function
          of the probabilities of those events. In what follows, let us assume that
          H(P1,P2,...,Pm) is such an uncertainty function, where Pi's are the probabilities
          of m events.
          
          An uncertainty measure must satisfy the following requirements:
          
          1.   If the number of equiprobable choices for the outcome of an
               experiment is increased, then the uncertainty about the 
               outcome is increased. For example, 
                H(1/2,1/2) is less than H(1/4,1/4,1/4,1/4)
          2.   The uncertainty as to the joint occurrence of two independent 
               events equals the sum of the uncertainties as to their 
               individual occurrences.
          3.   The uncertainty measure must be a continuous function of the 
               probabilities involved. In other words, a small change in 
               the probabilities must be associated with a small change in     
               uncertainty.
          
               It has been proven that there is only one function which satisfies
               the given conditions, namely the entropy.
          
               Given m possible outcomes of an experiment, having respective
          probabilities of P1,P2,...,Pm, the uncertainty function is given by:
          
               H(P1,P2,...,Pm)= -a*(SUM i Pi*log(Pi))
          
          where a is an arbitrary positive number, and the logarithm base 
          is greater than one.  This function is known in information 
          theory as the entropy of an information source.  In the 
          following examples, we use the value of 1 for a
          and a logarithmic base equal to 2.
          
               The entropy of an event measures our uncertainty about the 
          possible  outcome of the event. This measure can be extended to tell
          us how much information is gained regarding an event Y if we are told
          that a second event X has occurred. If Xi is the ith possible outcome
          for X and Yj the jth possible
          outcome for Y, and P(Yj /Xi) is the corresponding conditional 
          probability, then the conditional uncertainty of an event is 
          defined as follows:
          
               H(Y/X)= -SUM i,j P(Yj /Xi)*log[P(Yj /Xi)]
          
               It is always the case that H(Y/X) is less or equal 
          than H(Y). If X and Y are independent, then H(Y)=H(Y/X). 
          In general, the difference between the uncertainties,
          H(Y)-H(Y/X), is a measure of the amount of information contained in X
          about Y or viewed another way, how much influence the outcome of 
          event X has on event Y.
          
          
          6.2.3 Using entropy to optimize the classification tree
          
               Applying the concept of entropy to the decision tree optimization
          problem, we find that if an object can be classified into one of 
          several different groups, the entropy is a measure of the 
          uncertainty of the classification of that object. 
          Mathematically, if an object can be classified into N classes, 
          C1,...,Cn
          and the probability of the object being in class i is P(Ci), 
          then the entropy of classification C is
          
               H(C)= - SUM i P(Ci)*log(P(Ci))
          
          What is needed, in order to determine the attribute on which to 
          further subdivide a node, is the entropy (uncertainty) of 
          classification after an attribute is used. The decision tree 
          will then be extended using the attribute
          that results in the smallest entropy of classification.
          
             The first step in calculating the entropy of classification,
          H(C/A), for a particular attribute A, is to split the data table 
          into subtables in such a way that tuples in a subtable have the 
          same value of the partitioning attribute.
          
          The entropy of each subtable, H(C/aj), is calculated for each value 
          aj of the attribute A. For each possible outcome aj  of 
          attribute A, H(C/aj) is given by the expression:
          
          H(C/aj)= - SUM i P(ci/aj)log[P(ci/aj)+ n] 
          
               The function P(ci/aj) is the probability that the class value is
          ci when the splitting attribute value is aj. The number n is an 
          adjustment factor inserted
          so that the log is defined even when P(ci/aj)=0 and log[P(ci/aj)] is
          not defined.
          The overall entropy of classification, H(C/A), is then given by the 
          weighted average of the entropy for each value aj of the 
          attribute. Mathematically this is expressed as
          
          H(C|A)= SUM j P(aj)*H(C|aj)
          
          where m is the total number of possible values for the attribute A.
          
          We now illustrate the use of entropy for constructing decision trees.
          Consider the loan approval database. If the attribute "debt" is used
          to subdivide the patterns into two sets, one for each domain value of
          "debt", we get:
          
          
          LOAN
          INCOME
          HOME-OWNER
          DEBT
          
          
          no
          low
          no
          high
          
          
          no
          low
          yes
          high
          
          
          no
          average
          yes
          high
          
          
          no
          average
          no
          high
          
          
          no
          high
          no
          high
          
          
          yes
          high
          yes
          high
          
          
          yes
          average
          yes
          low
          
          
          no
          high
          no
          low
          
          
          no
          low
          no
          low
          
          
          no
          average
          no
          low
          
          
          
          
          
          
          We can now calculate the entropy of each subtable:
          
          H(loan/debt=high)=
          -[P(loan=yes/debt=high)log(P(loan=yes/debt=high))
          +P(loan=no/debt=high)log(P(loan=no/debt=high))]=
          -[(1/6)log(1/6)+(5/6)log(5/6)] = 0.196
          
          H(loan/debt=low)=
          -[P(loan=yes/debt=low)log(P(loan=yes/debt=low))
          +P(loan=no/debt=low)log(P(loan=no/debt=low))]=
          -[(1/4)log(1/4)+(3/4)log(3/4)]=0.244
          
               In order to find the entropy of the entire table after the split,
          H(loan/debt), we must take the sum of the entropy for each of the values of the
          attribute and multiply it by the probability that the value will appear in the
          table. 
          
          H(loan/debt)=
          P(debt=high)*H(loan/debt=high)+P(debt=low)*H(loan/debt=low)=
          (6/10)*0.196 +(4/10)*0.244= 0.215 .
          
               If we perform these same calculations for the other attributes in our
          example, we find that:
          
          H(loan/income)=0.180, and
          H(loan/home_ownership)=0.120.
          
          Since H(loan/home_ownership) yields the smallest entropy and thus the least
          uncertainty, "home_ownership" is the best attribute to select for the initial
          split. Indeed by using this attribute we get right away the very short rule
          
          IF (home_ownership=no) THEN (loan=no). 
          
          At each node, this decision process is repeated, in order to determine which
          attribute is to be chosen for further expansion. The reader can verify that
          "income" should be selected next, followed by "debt", and that the derived
          classification tree corresponds to the rules:
          
          IF (home_ownership=no) THEN (loan=no). 
          IF (home_ownership=yes and income=low) THEN loan=no,
          IF (home_ownership=yes and income=high) THEN loan=yes,
          IF (home_ownership=yes and income=average and debt=low)
               THEN loan=yes
          IF (home_ownership=yes and income=average and debt=high)
               THEN loan=no.
          
          
          6.3 LEARNING FROM NOISY DATA
          
          In many applications, the learning data set is imperfect. One common problem
          is that errors in attribute or class values and missing values can occur. In such
          cases, we say that the data is noisy. Inducing decision trees from noisy data,
          using the ID3 tree induction algorithm, gives rise to two problems: first,
          induced trees may unreliably classify new objects, and second, induced trees
          tend to be large and thus hard to understand. Noise makes the learning task
          more difficult and in order to effectively learn in a noisy domain, the
          requirement that learned concept descriptions match all the positive examples
          and no negative ones, is usually relaxed. The learned description is permitted
          to misclassify some of the learning objects. As an example, consider a
          situation in which we are to extend a node of a decision tree and let S be the
          current subset of objects in the node. Suppose that there are 100 objects in S,
          99 of which belong to class C1 and one of which belongs to class C2. Knowing
          that there is noise in the learning data set, and that all of the objects in S agree
          on the values of the attributes already selected up to this point, it seems
          plausible that the object in class C2, is in S due to an error in the data. If so, it
          is best to ignore this object, terminate expansion of that node in the decision
          tree and make it a leaf labeled with class C1. Since the original ID3 algorithm
          would in this situation further expand the decision tree, by stopping at this
          point we have pruned a subtree of the complete decision tree. Tree pruning is
          the key to coping with noise in tree induction algorithms. An algorithm may
          effectively prune decision trees by using some criterion that indicates whether
          to stop expanding the tree or not. The stopping criterion would typically take
          into account the number of examples in the node, the prevalence of the
          majority class at the node, etc. This kind of pruning, accomplished through
          stopping the tree expansion if certain criteria are met, is called
          construction-time pruning (pre-pruning). In an alternate method of pruning,
          called post-pruning, the whole decision tree is constructed first and then
          pruned, starting at the leaves and moving upwards.
          
          
          6.3.1 Construction time pruning
          
          Assume that the generated decision tree has become very "deep" (i.e., there
          are some long paths from the root to the leaves) and that most nodes, far away
          from the root, contain instances with the vast majority sharing the same value
          of the goal attribute, except maybe a few. This phenomenon indicates that
          those nodes should contain instances of a single class, and the small
          percentage of instances in the other class is probably due to noise. These
          nodes are created in the decision tree because the induction algorithm tries to
          overfit the training set. A good strategy is to stop branching when this
          situation occurs.
          
               As an example of a criterion, which can be used to determine when a
          node should be pruned, consider the depth pruning ratio. The depth pruning
          ratio is defined as follows:
          
          R(a,d)=exp(a,d)
          
          where the depth pruning threshold, a, is a constant ranging between 0 and 1,
          and d is the depth of the node under consideration for expansion (i.e., the
          number of branches from the tree root to the current node). During the
          induction of a decision tree, the percentages of positive instances and negative
          instances are compared with the depth pruning ratio at each node. When either
          of them is greater than or equal to the depth pruning ratio, this node is labeled
          as a leaf without further branching.  
          
               Based on the definition of the pruning ratio, it may be necessary to
          keep partitioning attribute nodes having nearly 100 percent instances of one
          class, when the nodes are close to the root. This can prevent excessive pruning
          and therefore ensure accuracy, which is guaranteed by the small depth of these
          nodes. On the other hand, if those attribute nodes are far away from the root,
          branches below them are probably unnecessary. The depth pruning ratios
          decrease as the depth of the node increases. Eventually, the percentage of
          positive (or negative) instances will exceed the depth pruning ratios, and these
          nodes will be labeled as leaves.
          
               It is crucial to have a good depth pruning threshold. A bad threshold
          will lead to either over-pessimistic pruning (e.g., a=1) or excessive pruning
          (e.g.,a=0). Whereas a good one can both prune the noisy branches effectively
          and avoid excessive pruning. Experimentation with different threshold
          choices is needed, until a satisfactory decision tree is derived. 
          
          For examples of other existing pruning techniques like retrospective pruning
          and the chi-squared method, the reader is referred to .
          
          
          6.3.2 Post Pruning
          
          We present here a post pruning algorithm, which was introduced in the
          Assistant 86 learning system, ([4]). The key decision in post-pruning is
          whether to prune a subtree or not. This decision is based on classification
          error estimates of the decision subtree. 
          
               Let S be the set of objects classified into the root of the subtree
          considered for pruning, and K be the number of classes (size of the domain of
          the goal attribute). Suppose that there are N instances in S, the majority class
          in S is C, n out of N instances in S belong to C, and the rest of the instances in
          S belong to other classes. If we prune the subtrees of this node and label the
          node with its majority class C, the result will be the introduction of a
          classification error. Under the assumption that all classes are equiprobable
          and that the prior distribution of the probabilities of the classes is uniform, the
          expected probability of the classification error at that node E(S) for set S is:
          
          E(S)= (N-n+K-1)/(N+K) 
          
               Such an error estimator can be directly applied to estimate the
          classification error at each leaf of a decision tree: Error(leaf)=E(S), where S
          is the set of instances at the leaf. This error estimator is easily generalized to
          estimate classification errors of non-leaf nodes in a decision tree as follows.
          
               Let Node be a non-leaf node of a decision tree with children Node1,
          Node2, etc. Let the probabilities of branches from Node to Nodei be Pi. The
          probabilities Pi can be approximated by relative frequencies of the sets of
          instances contained into nodes Nodei. Then we define:
          
          BackedUpError(Node) =  i Pi*Error(Nodei)
          
          We can now compare the node's "static" error with its backed-up error. In the
          case that the static error is lower, then we prune the node's subtree. 
          
               This formula suggests the following pruning algorithm to optimize the
          estimated classification error. Given a complete (unpruned) decision tree, the
          algorithm starts at the bottom of the tree and progresses toward the root of the
          tree, backing up the error estimates and pruning inferior subtrees. This
          algorithm guarantees that all of the subtrees are pruned optimally with respect
          to the classification error estimates. As a result, it will minimize the expected
          error at the root of the tree; that is, the expected error of the whole tree. Of
          course, the pruned tree obtained this way is only optimal with respect to error
          estimates.
          
               Figure 6.1 shows the pruning of a decision tree using the procedure
          described above. In the figure, the left-most leaf of the unpruned tree has class
          frequency [3,2], meaning that there are three objects of class C1 and two
          objects of class C2 contained into this leaf. The static error for these class
          frequencies is:
          
          E = (N-n+K-1)/(N+K)= (5-3+2-1)/(5+2)= 0.429
          
          Similarly, for the left-hand child of node b, the error estimate is 0.4 and for the
          right child the error is 0.2. For node b, the static error estimate is:
          E(b) = (11-9+2-1)/(11+2)=0.230.
          The backed-up error estimate for b is:
          backup_error(b)=(3/11)(0.4)+(8/11)(0.2)=0.254.
          Since the static error of node b is smaller than the backup, the subtrees of node
          b will be pruned making b a leaf. The process is repeated for node a.
          
          
          6.4 VERSION SPACE SEARCH AND CONCEPTUAL CLUSTERING
          
          We present an algorithm, ([21]), which is a combination of the candidate
          elimination algorithm, ([17]), and the conceptual clustering technique ([16]). 
          
               Assume that each training example is expressed as a vector of
          attribute-value pairs. The attribute values can be nominal or categorical in
          nature. 
          
               An object is defined as an instance vector of attribute-value pairs. For
          example, O=(age=new, competition=no, product=tv, profit=up) is an object.
          
               A concept is a generalization of an object and is described by the
          same description language as objects, but attributes can assume a set of values
          instead of just a single value. For example, C=(age={new,old},
          competition=no, product={tv,vcr}, profit=up) is a concept, which contains the
          object O as an instance. Each concept corresponds to a rule. For example,
          given that the goal attribute is the profit, concept C can be stated as: If a
          company is old or new, and there is  no competition, and the product is TVs
          or VCRs, then the profit is up. 
          
               A concept A is more general than a concept B or A covers B (A>B)
          iff each object which belongs to the concept B also belongs to the concept A.
          The relation "more general than" defines a partial order on the set of all
          possible concepts defined on an attribute set. The search techniques of AI can
          be used to search the lattice of concepts (the version space) in order to find a
          most general, complete and consistent concept covering the set of training
          examples. 
          
               A concept A is complete with respect to a set of training examples iff
          it covers all possible examples in the set.
          
               A concept A is consistent with respect to a set of training examples
          iff it contains none of the negative examples.
          
               A concept G is a maximally specific generalization of a concept A iff
          G is more general than A and there is no concept G' such that G>G'>A.
          Consider the concept A=(age=midlife, competition=no, product_type=VCR).
          Maximally specific generalizations can be obtained by assigning to one of the
          attributes one more value than the ones it already has. For example, the
          concept G=(age={midlife,new}, competition=no, product_type=VCR) is one
          of the maximally specific generalizations of A, since there is no concept less
          general than G which covers A.
          
               The algorithm given below is a variation of the candidate elimination
          algorithm ([17], [19], [20]). It is a bottom up algorithm, in contrast to ID3.
          ID3 starts with the whole set of examples or equivalently with the most
          general concept, ( ), and proceeds by specializing. The algorithm given below,
          starts with positive instances (the most specific concepts) and gradually
          generalizes them, so that they remain consistent. Heuristics are used to guide
          the search through the potentially prohibitive size of the version space. 
          
          Algorithm:
          
          Initialize the concept set (output) to empty.
          REPEAT
               Get a positive example.
               IF the example is covered by any of the concepts in the current       concept set THEN go to LOOP
               else
               begin
                    the example becomes the maximally specific concept at            the root level of a tree.
                    Current_level=1
                    REPEAT
                         For each concept node in the Current_level find             all its maximally specific generalization              concepts,which become its children in the tree, at               level: 
                         Current_level=Current_level+1.      
                         Delete the concepts from Current_level that are                  not consistent.  
                    UNTIL all nodes are leaves
                    Insert all concepts corresponding to tree leaves in the               concept set.
               end
          LOOP UNTIL no more positive examples.                
          
          
               If m is the number of attributes, then the size of a concept tree can be
          as big as 1+m+m(m-1)+m(m-1)(m-2)+...+m(m-1)...2.1÷ m!. To avoid the
          combinatorial explosion problem, heuristics can be used to guide the
          expansion of the trees. A fitness measure can be used to evaluate the potential
          usefulness of a node and only the best K nodes of a level are expanded each
          time. Let A be the attribute which will give the smallest uncertainty (entropy)
          of classification about the value of the goal attribute. The tree nodes, that will
          be expanded by the algorithm, will be the ones derived from their parent by
          extension of the attribute A. 
          
               Consider for example, the set of object vectors (from [26]), where the
          goal attribute is profit=up:
          
          
          PROFIT
          AGE
          COMPETITION
          PRODUCT_TYPE
          
          
          down
          old
          no
          software
          
          
          down
          midlife
          yes
          software
          
          
          up
          midlife
          no
          hardware
          
          
          down
          old
          no
          hardware
          
          
          up 
          new
          no
          hardware
          
          
          up
          new
          no
          software
          
          
          up
          midlife
          no
          software
          
          
          up
          new
          yes
          software
          
          
          down
          midlife
          yes
          hardware
          
          
          down
          old
          yes
          software
          
          
          
          
          The algorithm would expand the first positive instance as follows:
          
                              (midlife, no, hardware)
                                   |
                                   |
                    ___________________________________________                      |         |         |         |
          (midlife,no,{hardware,software})   |         |         |
                    (midlife,{yes,no},hardware)   |         |                             ({midlife,old},no,hardware)   |                                                    ({midlife,new},no,hardware) 
          
               The second, third and fourth concepts at level 2 of the tree cover the
          negative instance (down,midlife,yes,hardware), so they are eliminated from
          further consideration. The first node is expanded by its maximally specific
          concepts as follows:
          
          
          
                              (midlife,no,hardware)
                                   |
                         (midlife,no,{hardware,software})
                                   |
               _____________________________|______________________
               |                   |              |
          (midlife,{yes,no},{hardware,software})  |              |
                         ({midlife,new},no,{hardware,software})  |
                                        ({midlife,old},no,{hardware,software})
          
               The first and third of the leaves are inconsistent (they cover negative
          instances), so the concept set after the first positive example is the concept
          ({midlife,new},no, {hardware,software}). This concept corresponds to the
          rule: IF age=(midlife or new) and competition=no THEN profit=up.
          
               Repeating the process for each positive example, will also yield the
          rule: IF age=new THEN profit=up.
          
               As another example, consider the following set of object vectors,
          where the type of airplane is the goal attribute.
          
          
          PLANE
           TYPE
          ENGINE
           TYPE
           WING
          PLACED
           WING 
          SHAPE
           TAIL
          BULGES
          
          
          C130
          prop
          high
          regular
          regular
          under
          wings
          
          
          C141
          jet
          high
          swept
          T-tail
          aft_wings
          
          
          C5A
          jet
          high
          swept
          T-tail
          none
          
          
          C747
          jet
          low
          swept
          regular
          aft_cockpit
          
          
          With a branching factor of K=1, and the entropy used as a heuristic measure,
          the algorithm will derive the same rule set as ID3, namely:
          
          IF bulges=under_wings THEN airplane=C130
          IF bulges=aft_wings THEN airplane=C141
          IF bulges=none THEN airplane=C5A
          IF bulges=aft_cockpit THEN airplane=C747
          
          For K=2, three more rules will be derived in addition to the rules above:
          
          IF wing_shape=regular THEN airplane=C130
          IF engine=prop THEN airplane=C130
          IF wing_position=low THEN airplane=C747
          
               Finally we give an example where ID3 fails, but the conceptual
          clustering algorithm succeeds. Consider the following database containing
          contradictions:
          
          
          
          DISEASE
          EYES
          TEMPERATURE
          NOSE
          
          
          none
          clear
          normal
          clear
          
          
          cold
          bloodshot
          high
          runny
          
          
          flu
          bloodshot
          high
          runny
          
          
          
          ID3 fails, while the algorithm described above gives the following set of rules:
          
          IF nose=clear THEN disease=none (for branching factor K=1)
          
          or
          
          IF nose=clear THEN disease=none and
          IF eyes=clear THEN disease=none (for branching factor K=2).
          
          
          6.5 CASE BASED REASONING
          
          Case-based reasoning (CBR) is a form of analogical reasoning which has
          enjoyed increased popularity as a developmental methodology due to its
          avoidance of the knowledge acquisition bottleneck and the natural way in
          which it integrates with object oriented environments. In case-based
          reasoning, the system recalls previous situations similar to the current one
          and adapts them to help solve the current problem. The existing descriptions
          of previously encountered problems are known as cases. Cases are usually
          implemented as objects in an object oriented environment even though other
          knowledge representation schemes (for example attribute-value pairs, rules
          or conceptual graphs) have also been used depending on the domain
          characteristics.
          
               The development of a CBR system involves the collection of a
          depository of cases, deciding which features are important, and choosing an
          appropriate case representation scheme. The CBR system consists of the case
          database, and subsystems for case indexing, case retrieval, and case
          adaptation.
          
               When a new case is presented, the case retrieval subsystem searches
          the indexed database of cases for the most similar case. The adaptation
          subsystem uses the retrieved case as a model in order to create a solution for
          the new problem, to warn the user of possible failures that have occurred in
          the past, and to interpret the current situation. Two types of adaptation are
          possible. In structural adaptation the solution of the retrieved case is
          modified to fit the new case, while in derivational adaptation a new solution
          is created by using the same process as used in the retrieved case. If no case
          can be retrieved which is sufficiently similar to the current case, then the
          current case together with a solution supplied by a human expert is stored and
          indexed by the case indexing subsystem into the database of cases, thus
          enabling the system to learn.
          
               Several techniques are available for indexing a database of cases
          including conceptual clustering, inductive indexing, and nearest neighbor
          indexing. In inductive indexing an algorithm such as ID3 or Michalski's
          conceptual clustering is used to create a classification tree or a clustering
          which groups sets of similar cases together. When a new case is presented, the
          values of its attributes determine the path down the classification tree that
          should be followed or the cluster that contains the most similar cases to the
          new one. In the nearest neighbor indexing scheme, each feature of a case is
          assigned match and mismatch weights by the developer. The match weights
          indicate how important and necessary is the presence of this feature in order
          to call two cases similar. The mismatch weight indicates how important to
          similarity is the absence of this feature from one of the compared cases. The
          weights are numerically manipulated and used by various similarity metrics
          to retrieve the most similar case.
          
               Several case-based expert system shells or shells with case-based
          subsystems have become available including KwEshell, ART-IM, CBR
          Express, Esteem, ReMind (see Appendix C). Application areas ideally
          suitable for CBR system development include the legal domain, intelligent
          tutoring, and systems design. For a thorough introduction to Case-based
          reasoning, the reader is referred to the book by Kolodner, ([14]).
          
          
          6.6 EVOLUTIONARY MACHINE LEARNING
          
          The evolutionary approach to machine learning is based on the concepts of
          genetic algorithms and classifier systems.
          
               Genetic algorithms and classifier systems are an optimization, search
          and machine learning technique, suitable for a variety of problems, including
          NP-complete, and state-space type of problems. For a detailed introduction
          to the topic, the reader is referred to Davis, ([6]), Goldberg, ([11]) or Holland,
          ([13]).
          
          
          6.6.1 Genetic Algorithms
          
          Genetic Algorithms are based on the concepts of natural selection and
          genetics. They were introduced by Holland, ([13]), with the intention of
          designing a computer system with many of the characteristics of natural
          systems. 
          
               A genetic algorithm is a global search technique as it does not look
          at a single point from where to expand the search for optima, but it evolves a
          population of points towards the optimum solution. This feature makes them
          particularly suitable for parallel processing and also reduces the probability
          of getting stuck at local optima. 
          
               A genetic algorithm solves a problem as follows. An initial set of
          possible solutions is chosen at random and the solutions are encoded as finite
          strings of symbols. The encoded set of possible solutions is called a
          population. The genetic algorithm does not need extensive knowledge of the
          problem domain in order to perform an efficient search of the solution space.
          It only requires a function (the fitness function) to measure the fitness of a
          candidate solution. The closer an encoded string is to being a solution, the
          bigger its fitness value should be. The initial population is successively
          transformed to populations of greater and greater overall fitness until a
          convergence criterion is met. During evolution the population size usually
          remains constant. The basic genetic operators used to transform a population
          to a better one are: reproduction, crossover and mutation. 
          
               In reproduction, the individuals in the population are propagated to
          the next population with a probability proportional to their fitness, in such a
          way that individuals of greater fitness have a bigger chance of surviving in the
          next population. Many alternate ways have been proposed to achieved this,
          with the simplest but still quite effective being the original one proposed by
          Holland. In this approach, usually referred to as the roulette wheel approach, 
          the overall fitness of the population is calculated as the sum of the fitness of
          the individuals. The ratio of an individual's fitness to the overall fitness
          corresponds to a proportional slice of a roulette wheel. Which individuals
          make it to the next generation is determined by successive spins of the roulette
          wheel. For example, suppose the present population consists of the
          individuals 11000, 00111 and 10101 with corresponding fitness 30, 60 and
          10. Then the spinning wheel is divided into three subareas, subarea 1, 2 and
          3, corresponding to 30%, 60% and 10% of its total area. The wheel is spun
          three times to determine the three individuals in the next generation. If the
          wheel lands in area 1 then the bit string 11000 is reproduced, if it lands in area
          2 then the string 00111 is reproduced, and in area 3 then 10101 is reproduced.
          Clearly the string 00111 is twice as likely as string 11000 and six times as
          likely as string 10101 to reproduce.
          
               Crossover introduces variety in the population of candidate solutions
          by combining two existing individuals (parents) into new individuals
          (offsprings). Many forms of crossover are possible, and in its simplest form
          crossover works as follows. If k is the length of the two strings, randomly
          select a position p between 1 and k. The new strings are produced by merging
          positions 1 to p of parent 1 with positions p+1 to k of parent 2, and positions
          1 to p of the parent 2 with positions p+1 to k of parent 1. For example,
          consider the individuals 110101 and 010011. If crossover is applied at
          position 3, the resulting offspring is 110011 and 010101.
          
               Mutation introduces variation to the population by randomly altering
          individuals. This can lead the algorithm out of stagnation at local minima. One
          way to implement mutation is to randomly pick an individual and a position p
          between 1 and k, and change the value at position p to another domain value
          chosen at random. 
          
               Genetic algorithms can be analyzed by looking at how schemata
          behave during the evolution of the population from generation to generation.
          A schema is a set of encoded solutions. For example, the schema 11**
          represents all strings of length four, with the first two bits equal to 1. The
          Fundamental Theorem of Genetic Algorithms shows that short, highly fit
          schemata propagate themselves through evolution in a manner of exponential
          growth and converge to a solution ([11]). 
          
               Solving a problem using genetic algorithms, involves finding a way
          to encode the candidate solutions as finite strings, finding an appropriate
          fitness function to evaluate the goodness of a candidate solution and creating
          appropriate reproduction, crossover and mutation operators for the given
          problem. This process is illustrated by the following example.
          
               Consider the boolean satisfiability problem (SAT). Given a boolean
          expression of n variables, is there an interpretation (assignment to the
          variables) that will make the formula true? This is an NP-complete problem
          (it cannot be solved in polynomial time. As the number of variables increases,
          the complexity increases exponentially). There is an obvious way to encode
          the candidate solutions as strings. In fact they are binary strings of fixed
          length. It takes more thought to find an appropriate fitness function. The
          obvious choice of assigning a fitness of 1 to a string that satisfies the boolean
          expression and 0 to one which does not, will not work. The majority of the
          population elements will have fitness equal to 0. Many variations of fitness
          functions for the SAT have been proposed in the literature, with one of them
          being the following. To evaluate the fitness of a binary string as a candidate
          interpretation to a boolean function, use the following formulas:
          
          fitness(x)=1, for bit 1 and 0 for bit 0
          fitness(ªx)=1-fitness(x)
          fitness(x1 ... xn)=max(fitness(x1),...,fitness(xn))
          fitness(x1 ... xn)=(1/n) ifitness(xi)
          
               For example, consider the function
          b(x1,x2,x3,x4)=(x1 x2) (x3 ªx4) ªx1. In order to find a substitution which
          satisfies the formula, the genetic algorithm proceeds as follows: 
          
               Randomly pick a population of binary strings of length 4, say: 1011,
          1001, 0010, 1101. For each string evaluate its fitness:
          
          fitness(1011)=(1/3)[max(fitness(x1),fitness(x2))+max(fitness(x3),1-        fitness(x4)) +(1-fitness(x1))]
          =(1/3)[max(1,0)+max(1,1-1)+(1-1)]
          =(1+1+0)/3=2/3
          fitness(1001)=1/3
          fitness(0010)=2/3
          
               The roulette wheel will be separated into three subsections of area
          40%, 20% and 40% of the total area. During reproduction, the first and third
          strings will have double the chance to reproduce than the second string.
          Strings will randomly be chosen to be crossed over at a random position. For
          example, if string 1011 and 0010 are crossed over at position 2, they will
          produce the offspring 1010 and 0011. Mutation here is obviously necessary,
          as any eventual solution must have the second bit equal to 1, and this cannot
          happen from the initial population just by reproduction and crossover. As
          generation after generation of candidate solutions are produced, somewhere
          down the line, the random choice of which bit to mutate will be for position 2,
          introducing in the population the desired 1 in position 2. Parameters such as
          the population size and mutation rate (how often to apply mutation) affect the
          speed of convergence of the GA. 
          
          
          6.6.2 Classifier Systems
          
          Classifier systems have certain similarities with production systems as they
          are also based on match-select-act cycles. But unlike production systems,
          which typically select to fire one rule at a time, classifier systems can activate
          multiple rules in parallel. They also differ from production systems in that
          they employ competitive arbitration strategies to select which rules are to be
          fired and a reward based system for evaluating the strength of individual rules.
          The rule base of a classifier system continually evolves, through the use of
          genetic algorithms which evolve weak rules to rules of greater strength. 
          
               A classifier system is a machine learning system which consists of
          four main modules: a classifier pool, a message list, an apportionment of
          credit algorithm, and a Genetic Algorithm, ([11]).
          
               The classifier pool is a set of production rules (classifiers) written
          in the form:
          
          condition:action, strength
          
          which represents the rule:
          
          IF (condition) THEN (action): strength.
          
               A single condition or a conjunction of conditions can be present in the
          classifier. The strength assigned to a rule is indicative of its usefulness to the
          system. The condition and action part of the rule are usually encoded as strings
          of a finite symbol set, for example {1,0,#}. Both conditions and actions are
          encoded as strings of the same length k. The # symbol corresponds to the
          "don't care" element. If it is present in a condition, then it can be matched by
          either a 1 or a 0 in the corresponding position of a message.  
          
               Messages (inputs) posted by the environment are written on a
          blackboard (message list). Classifiers, whose condition part matches a
          message on the blackboard, compete and the winners activate and post their
          action as a message on the blackboard or as output of the system. The process
          is repeated, producing a chain of classifier firings. When a classifier fires and
          it has the # symbol present in its action part, the symbol, in the corresponding
          position of the message that matched its condition, is assigned to the # symbol.
          For example, if the message list contains the message 1101011, the classifier
          11#10#1:000#1## can fire and post the message 0001111. 
          
               An apportionment of credit algorithm is used to decide which
          classifiers will fire and also update the strength of classifiers in proportion to
          their usefulness. An example of such a system is the Bucket Brigade
          Algorithm. Initially, on the absence of any information about the usefulness
          of the classifiers, all of the classifiers are assigned equal strength. Classifiers,
          which are eligible to fire by matching a posted message to their conditions, bid
          an amount proportional to their strength for the right to fire. The winner
          classifiers (a set of classifiers posting the highest bids) fire and post their
          actions as messages on the blackboard. The strength of a firing classifier is
          reduced by the bid amount. The bid amount is divided, among the classifiers
          which posted the messages enabling the winner to fire, and is added to their
          strengths. The following difference equation determines how the strength of
          a classifier is updated in time t+1 (after a firing cycle) in terms of its strength
          in time t:
          
          S(t+1)= S(t)-B*S(t)-T*S(t)+R(t)
          
          where B is the bid rate with range 0 B 1.0. If the classifier did not fire, then
          B=0. R(t) is the reward the classifier is due for posting any messages enabling
          others to fire and T is the tax rate, 0 T 1. Each classifier pays to the
          environment an amount proportional to its strength, to avoid predominance of
          the rule population by the stronger rules. Usually T is chosen much smaller
          than B, and in such a way that 0 T+B 1. It has been shown that the difference
          equation given above is stable, in the sense that as t approaches infinity S(t)
          approaches 0. 
          
               The quality of the rules in the classifier pool is improved by applying
          genetic algorithms to the population of classifiers, with the classifier strength
          used as their fitness value. The genetic algorithm can be called to improve the
          classifier pool, either probabilistically, with an average number of time steps
          between calls, or deterministically, with a fixed number of time steps between
          calls.
          
          
          6.6.3 Deriving or improving rule sets with genetic algorithms, BEAGLE
          
          One of the first systems, to use genetic algorithms for deriving rules from data,
          was the BEAGLE system (Biologic Evolutionary Algorithm Generating
          Logical Expressions, Forsyth, [8]). Given a set of training examples, of the
          form (attribute1=value1,..., attributen=valuen), where attributen is the goal
          attribute, we wish to derive rules specifying under what conditions on the
          other attributes, the goal attribute will assume a given value.
          
               For example, consider the company data of Section 6.2.1. Assume
          that the possible domain values for the attributes are:
          
          for company age:  1 (very old), 2 (old), 3 (midlife), 4 (new), 5 (very new)
          for product type:  1 (software), 2 (hardware).
          for competition:  1 (if there is competition), 0 (if there is no competition).
          for profit:  1(if profit is up), 2 (if profit is down).
          
          The objective is to derive rules of the form:
          
          IF conditions THEN profit=1.
          
               The Beagle algorithm randomly picks a population of rule conditions.
          For example:
          
          condition 1: (age<=3) 
          condition 2: (product type<2) (competition=1).
          
          The first condition corresponds to the rule:
          
          IF company is very old or old or midlife, then the profit is up.
          
          The second condition corresponds to the rule:
          
          IF product is software and there is competition, then the profit is up.
           
               Possible positions for crossover are the relational operators (<, >,
          etc.) and logical operators (ª,  ,  ). In the conditions above, if the positions
          of <= and < are randomly chosen for a crossover, the offspring conditions after
          crossover will be (age<=2) (competition=1) and (product type<3). These
          conditions correspond to the rules:
          
          IF company is old or very old and it faces competition then profit is up and 
          IF product is software or hardware THEN profit is up.
          
               If, on the other hand, the relational operator <= and the logical
          operator " " were picked as the crossover points, then the following two
          conditions would be produced after crossover: 
          
          age<=(competition=1)
          and 
          product type<2 and 3 which reduces to:  product type<2.
          
          Semantically, these conditions correspond to the rules:
          
          If the age is very old and competition is present then profit is up
          and
          If product type is software then profit is up. 
          
               The genetic algorithm of Beagle produces the next population of rules
          from the present population as follows.
          
               Each rule is evaluated on all the training examples and its fitness
          calculated in terms of the number of examples correctly classified by the rule.
          The rules are ranked in descending order of fitness and the bottom half of
          weakest rules is removed. Pairs of surviving rules are randomly crossed over,
          as described above, and the offsprings inserted in the population. Finally, a
          few rules may be mutated at random by changing the value of an attribute or
          its value in the rule. The process is repeated, producing generations of rules,
          until a determined number of iterations is reached or rules of satisfactory
          classification accuracy are reached.
          
               Beagle has successfully been used to produce the knowledge base for
          a number of different applications, including a system to classify glass
          fragment evidence in forensic science (Evett and Spiehler, [7]). 
          
          
          6.6.4 Deriving a knowledge base for insurance underwriting using
          genetic algorithms
          
          Below, we give another example of how genetic algorithms were used to
          automatically derive a knowledge base for insurance underwriting
          (Nikolopoulos, Duvendack, [22]). The machine learning component of the
          described expert system extracts underwriter rules from a collection of policy
          holders' records, in order to determine when a policy should be cancelled.
          
               Policy holders' records are given in the form of attribute-value pairs.
          One of the goal attributes is the "terminate_policy" attribute which assumes
          one of the values:
          terminated_by_company, terminated_by_policy_holder and active_policy.
          Each data record had 19 attributes. The training set was noisy, mainly because
          of missing attribute values. The data set was separated into a training and a
          testing set, containing 70% and 30% respectively of the data patterns
          available. 
          
               Each individual of the genetic population consists of a sequence of
          rules and represents a candidate knowledge base for the problem domain. The
          19 attributes were labeled from 1 to 19 and each rule in an individual is of the
          form: a1,a2,...,a19,gi where ai belongs to the domain of attribute i or it is equal
          to *, and gi is a value of the goal attribute. The value of * corresponds to the
          don't care symbol, and was introduced in order to maintain the individual's
          length as a constant, while at the same time permitting rules of varying
          lengths. The presence of * in position i of the rule chromosome indicates that
          attribute i is not present as a premise in the rule. For example, the encoding
          2*0**...*3 corresponds to the rule:
          
          IF (number_of_claims=2 and underwriter_code=no-violations) THEN
          policy=active,
          
          since number_of_claims was the first attribute in the attribute labeling, and
          underwriter_code was the third. The value 3 was used to encode the value
          "active" for the goal attribute policy.
          
               The given encoding enables all rules to be of constant length. The
          number of rules in an individual is also held constant and needs to be supplied
          to the algorithm. When a random initial population of individuals is created,
          as well as during mutation, the don't care symbol, *, is given a higher
          probability of selection, in order to ensure rules of varying lengths and
          candidate knowledge bases of varying sizes. The presence of the don't care
          symbol also makes the rules more general.
          
               The fitness of an individual rule can be defined as the percentage of
          instances correctly classified by the rule out of the total number of instances
          which match the rule's premises. The fitness of a rule indicates the
          effectiveness of the rule as a discriminator, and the strength of an individual
          is the sum of the fitness of the rules in that individual. The fitness of an
          individual is the ratio of its strength divided by the sum of the strengths of all
          individuals. 
          
               Convergence of the genetic algorithm after 20 generations produced
          a rule set which had a classification accuracy of 75 percent. This was quite
          good and comparable to the performance of human underwriters. But when a
          domain expert was given the rules for critique, it was observed that many of
          the rules did not make sense semantically and some attributes which were not
          supposed to influence the decision about terminating policy appeared
          frequently in the rule premise (for example, the attribute type_of_coverage,
          family_or_individual etc.). These attributes were eliminated from
          consideration, and the genetic process was run again with the best rule set
          produced having a classification accuracy of 61 percent, but being more
          agreeable and in the line of thought of the human expert underwriters. 
          
               It is also possible to apply genetic algorithms in order to improve on
          a rule set derived by another inductive learning technique. As an example the
          reader is referred to [22] where a hybrid machine learning technique was used
          for generating the knowledge base. The hybrid approach outperformed both
          of the other approaches used individually, and is further discussed in Chapter
          8.