How to build weka classifiers using ECJ library ?

A short version is available here.

If you have any question please do not hesitate to ask. You can contact me at romaric DOT pighetti AT gmail DOT com.

Goal and introduction

The goal of this tutorial is to allow people to build classifiers, clusterers or other tools for weka using evolutionary algorithms coded and parametrized with the ECJ library. In order to do that, a few classes will be given to you in this tutorial along with explanations about their content and how to use them. At the end of this tutorial, an archive is available for download with the sources of these classes along with a jar file containing the compiled classes.

Pre requisites

Before begining this tutorial you need:

  • A good knowledge of Java.
  • ECJ sources available at http://cs.gmu.edu/~eclab/projects/ecj/.
  • Weka available at http://www.cs.waikato.ac.nz/ml/weka/. You'll need to have access to the jar file or the sources to be able to compile your clusterer/classifier/assciaton algorithm. I recomend you download the Linux version which is not a self extracting executable archive. It is a simple archive and you'll find the sources enclosed two jarfile when extracting it, one containing the compiled class and one containing the source code if you want to explore it. I use weka 3.6.6 here.
  • If you want to explore ECJ's sources indeep and recompile the hole thing then you'll probably want the dependencies needed to compile the GUI of ECJ. Namely: jFreechart, iText and jcommon all enclosed in the jfreechart download available at http://sourceforge.net/projects/jfreechart/files/. You can take only the iText-2.1.5.jar, jcommon-1.0.17.jar and jfreechart-1.0.14.jar within the lib folder in the archive. The other files are useless to compile ECJ.
  • Knowledge of ECJ. Tutorials can be found at http://cs.gmu.edu/~eclab/projects/ecj/docs/.

Class diagram:

Here is a short diagram representing what we will add to ECJ in to be able to use it correctly with weka and how it is connected to the core of ECJ. UML Class diagramm ]

Getting the results of the evolutionary algorithm

ECJ is based on parameters files which contains the parameters of the evolutionary algorithms, including the name of the calss used to perform the different parts of the algorithm. The only way to have access to the results of the computation in a proper manner is through the statistics process. Thus I cretaed the class ec.weka.StatisticsForWeka which role is to keep the best individuals computed for each subpopulation in each jobs. This class is also responsible of saving this information and restore it uppon checkpointing.

ec.weka.StatisticsForWeka:

package ec.weka;
 
import ec.EvolutionState;
import ec.Individual;
 
public class StatisticsForWeka extends ec.Statistics {
 
    private static final long    serialVersionUID = 8272467899443777505L;
 
    public static Individual[][] bestOfJobs;
 
    public static boolean        staticBestInit   = false;
 
    /**
     * For checkpointing usage.
     */
    private Individual[][]       bestOfJobsCopy;
 
    /** The best individual we've found so far */
    private Individual[]         best_of_run;
 
    public Individual[] getBestSoFar() {
        return best_of_run;
    }
 
    public void postInitializationStatistics(final EvolutionState state) {
        super.postInitializationStatistics(state);
 
        // set up our best_of_run array -- can't do this in setup, because
        // we don't know if the number of subpopulations has been determined yet
        best_of_run = new Individual[state.population.subpops.length];
    }
 
    /** Logs the best individual of the generation. */
    public void postEvaluationStatistics(final EvolutionState state) {
        super.postEvaluationStatistics(state);
 
        // for now we just print the best fitness per subpopulation.
        Individual[] best_i = new Individual[state.population.subpops.length]; // quiets
        // compiler
        // complaints
        for (int x = 0; x < state.population.subpops.length; x++) {
            best_i[x] = state.population.subpops[x].individuals[0];
            for (int y = 1; y < state.population.subpops[x].individuals.length; y++)
                if (state.population.subpops[x].individuals[y].fitness
                        .betterThan(best_i[x].fitness))
                    best_i[x] = state.population.subpops[x].individuals[y];
 
            // now test to see if it's the new best_of_run
            if (best_of_run[x] == null
                    || best_i[x].fitness.betterThan(best_of_run[x].fitness))
                best_of_run[x] = (Individual) (best_i[x].clone());
        }
    }
 
    public void preCheckpointStatistics(final EvolutionState state) {
        this.bestOfJobsCopy = bestOfJobs;
    }
 
    public void putBestJobsFromCopy() {
        if(!staticBestInit) {
            bestOfJobs = this.bestOfJobsCopy;
            staticBestInit = true;
        }
    }
}

This class is used to get the results of the computation and give it to weka (via the custom Evolve class which is studied later in this tutorial). So if you want your algorithm to be able to communicate its results to weka, you'll need to make your statistics class derive from this one.

You probably noticed that there are 2 variables storing the best individuals. This is completely normal and necessary. Indeed, the static field is not stored uppon serialization. So when checkpointing, the information stored in it will be lost when restoring. In the other hand, the private field is renew between two jobs because we build a new statistic class for each jobs. In conclusion the information will be lost from one job to another. That's why we need both of them.

The private field is updated before each checkpoint and allow us to restore data uppon checkpoint loading, the static field is always up to date and allow us to keep trace of the bests individuals from one run to another.

We couldn't put this information in the Evolve class since it is not part of the serialized data when checkpointing. That's why i choose to put it there.

Running the algorithm and get the results

The standard way to run an ECJ algorithm is to use the ec.Evolve class form the command line, passing the parameter or a checkpoint file as a parameter. The problem is that no results are accessible from the program after the main method exits.

In order to be able to get the results, I created the ec.weka.Evolve class. This is a copy of the ec.Evolve class with some changes. The run method of the ec.weka.Evolve class is a modified copy of the main method of the ec.Evolve class. It returns the best individuals of each subpopulation in each jobs, getting them from the statistics of the EvolutionState. The statistics field is cast into ec.weka.Statics after verification that the statistics field is a child class of it to get the results. If statistics is not a child class of ec.weka.Statics then an empty array is returned.

A main method is also implemented to allow users to run their algorithms from the command line. However this feature is mainly made for tests. The main method only call the run method and then exits.

The changes introduced are mainly initializations of the variables used to store the results in different cases.

Since the class is quite long, I won't print it here but you can find it in src/ec/weka/Evolve.java

The only changes are in the run and main method and the addition of a new field of type Instance[] which usage is explained in the next section. You can skip the rest of the file.

Getting data from Weka

Weka allow its users to compare the results from different data-mining algorithms running in the same conditions. It is made to perform classification, clusterisation and association. In addition a lot of well known algorithms are already implemented in it, which make it easier to compare your algorithm with existing techniques.

Evolutionary computation can be used in different manners in this context. Indeed, evolutionary algorithm are able to compute decision trees, clusters, association rules and a lot of other things. That's why i'll try to keep this tutorial as general as possible. However, in order to make it clear, i use the exemple of a clusterer in the following. Even if this clusterer doesn't compute anything, it allow me to show you the lines you'll need to write to make ECJ and Weka comunicate.

Weka has three main categories of algorithms:

  • Classifiers
  • Clusterers
  • Associators

Weka uses dynamic class loading to identify all the available algorithms, however you'll need to inherit from a given base class or implement the good interface if you want your alogrithm to be recognize by weka. I won't go in much details for each type of algorithms here, there is an article from weka's wiki explaining how to write your own calssifier here: http://weka.wikispaces.com/Writing+your+own+Classifier+%28post+3.5.2%29#Coding-Base-classifier, and writing your own Associator shouldn't be much difficult.

Some algorithms need to learn from a set of data before being actually run, others classify instances and so they need an instance on which they can work etc. All these data are given by weka, enclosed in Instances, Instance and Attribute objects (we'll see how to use them in the following chapters).

If we want to acces to these data in an evolutionary algorithm using ECJ, we'll have to find a way to put them in a class accessible from the components of the evolutionary algorithm. That's why i put a private field of type Instance[] (I'm not using Instances because it is hard to create and will make it harder to use ECJ algorithms for some tasks) in ec.weka.Evolve. This field will contain the data we want to comunicate from weka to ECJ if any, else it is null. If needed, this variable must be set after creating the ec.weka.Evolve object and before calling the run method on it. Then, before begining each job, this variable will be assigned to a private variable of the EvolutionState created for the job automatically.

We need to do so if we want the algorithm to have access to this data. Indeed, the algorithm have access to its state but not to the Evolve instance that launchces it. In addition, if we want the data to be serialized when checkpointing, we have to store it in the EvolutionState or somewhere in the main part of the algorithm, because only information about the state of the algorithm are saved when checkpointing.

A new instance of the EvolutionState is created for each job. That's why we can't store the given data only in the state.

After reading this you must have understood that i created a special EvolutionState for this purpose. This state extends ec.EvolutionState and implements a new field with getter and setter. Here is the complete class:

ec.weka.EvolutionStateForWeka:

package ec.weka;
 
import ec.EvolutionState;
import ec.Individual;
 
public class EvolutionStateForWeka extends EvolutionState {
 
    private static final long serialVersionUID = 3779199959740620263L;
 
 
    /**
     * Stores the best individuals of each job. Used by Evolve to 
     * transfer the results to weka.
     */
 
    protected weka.core.Instance[] learningDataSet;
 
    public void setLearningDataSet(weka.core.Instance[] lds) {
        learningDataSet = lds;
    }
 
    public weka.core.Instance[] getLearningDataSet() {
        return learningDataSet;
    }
 
}

I added the same field with a setter in the ec.weka.Evolve class. Know each algorithm using an EvolutionState that extends this one is capable of receiving data from weka.

At this stage, the compilation begins to be more complicated. You need to add the folder containing the compiled class of weka, or a jar of weka to the classpath in order to compile this last class.

One of the problems now is: when do i heve to transfer data form ecj to weka and when do i run my evolutionary algorithm ?

This depends on what you want to do ! But here are some clues:

  • The training data (or data on which we are trying to find assicaitions) are given by weka in the buildClusterer (builClassifier or buildAssociatons depending on what you're doing) method enclosed in an Instances object.
  • For classifiers and clusterers, each new Instance to be clustered or classified is given in the classifyInstance or clusterInstance method enclosed in an Instance object.

Using the Instances and access to the information enclosed within it

The Instances class give you access to a set of Instance. Each instance has a set of attributes which contains values.

In weka there are 5 types of value:

  • NOMINAL: This type of attribute has a set of possible values. Each instance has one of these value as value for this kind of attribute.
  • DATE: This type of attribute stores a date.
  • NUMERIC: This kind of attribute stores a double value.
  • STRING: This kind of attribute stores a string.
  • RELATIONAL: This is a special type of attribute used for multi instance classfification. See weka's documentation for more details.

Every values are stored as double values in the Attribute class by weka. These double values are stored in Instance and thus can be retreived from them. Numeric attributes don't need to be interpreted but others need. The Attribute class exposes some methods allowing to retreive the original value from the double value if necessary. It also exposes methods to know the type of the attribute:

Method signatureDescription
int type()Returns an int representing the type of attribute. Can be test against Attribute.NOMINAL, Attribute.NUMERIC, Attribute.DATE, Attribute.STRING and Attribute.RELATIONAL.
String formatDate(double date)For a date attribute, format the String representing the date following the date format specifying in the attribute object (See weka documentation for more details).
int numValues()Number of values of the attribute. Only for nominal, string and relational attibutes.
String value(int valIndex)Get the string value at the given Index. For nominal and String values. The valIndex is the double value of the given attribute for an instance.

These are the most important methods for the first usage. Other methods are available to get the lower and upper bounds for numeric values or other things.

Now I will present some usefull functions of Instances and Instance that allow you to retreive information contained in these classes:

Function SignatureClass of the functionDescription
int numInstances()InstancesReturns the number of instances enclosed in this Instances.
int numAttributes()InstancesReturns the number of attributes in this Instances. Some Instance can have missing values, you should filter the data for missing values before passing them to the ECJ algorithm or specifiy in the weka class that you don't handle data with missing values. Else you'll have to handle this in the ECJ algorithm, which I think is a bad idea. You'll see how to do that in the following of this tutorial.
Instance instance(int index)InstancesReturns the Instance present at the specified index.
int classIndex()InstancesReturns the index of the class attribute. Attributes are retreived by index, it is usefull mainly for classification learning process. Indeed, the class attribute gives the class in which this Instance is.
int numAttributes()InstanceReturns the number of attributes of this Instance.
int classIndex()InstanceReturns the index of the class attribute. Attributes are retreived by index, it is usefull mainly for classification learning process.
double value(int attIndex)InstanceReturns the value of the Attribute at the specified index as a double. This is the value to be used in the Attribute methods explained above.
Attribute attribute(int index)InstanceRetreive the Attribute stored at the given index. Use the same index as the value method explained above to have the corresponding Attribute instance and retreive the String, Nominal or Date original value if necessary.

It should be all you need at first to browse an Instances content and retreive the information it contains. More methods are availables, if you want to go further read weka's documentation together with the source code.

Build a new algorithm in weka

Weka 3.6.6 come along with a feature that look for Classifiers and Clusterers alogorithms dynamically. This reduce the effort when creating a new algorithm. I will explain the conception of a clusterer. However, it shouldn't be very different for a classifier or an associator.

Weka's clusterers must must implement the Clusterer interface or extands the AbstractClusterer class, wihch is a bit simpler i think, or extands any class that implements the Clusterer interface. It must also be in the classpath when lauching weka, else it won't apear anywhere. I'll extends AbstractClusterer in this tutorial. Let's see what involves extending the AbstractClusterer.

When extending AbstractClusterer you must implement the buildClusterer method. This method perform the learning from the data given to it. If no learning has to be performed, then the data should be empty or you can ignore it.

You must also implement the numberOfClusters method which returns an int giving the number of clusters created.

Once these two methods are implemented, there is at least one step left. You must implements either the clusterInstance method or the distributionForInstance method. These are the method which perform the clusterisation on new instances. If your clusterer derives from the AbstractClusterer, when implementing only one of these two methods, the other will be computed with the results of the one you implemented. See the code of AbstractClusterer for details, it's simple.

At this stage we have this :

ECJBasedClusterer.java, first step:

package weka.clusterers;
 
import weka.core.Instance;
import weka.core.Instances;
 
public abstract class EcjBasedClusterer extends AbstractClusterer {
 
    private static final long serialVersionUID = -5129098585519877366L;
 
    private final int numCluster = 1;
 
    /**
     * Generates a clusterer. Has to initialize all fields of the clusterer
     * that are not being set via options.
     *
     * @param data set of instances serving as training data 
     * @exception Exception if the clusterer has not been 
     * generated successfully
     */
    @Override
    public void buildClusterer(Instances data) throws Exception {
 
    }
 
    /**
     * Returns the number of clusters.
     *
     * @return the number of clusters generated for a training dataset.
     * @exception Exception if number of clusters could not be returned
     * successfully
     */
    @Override
    public int numberOfClusters() throws Exception {
        return numCluster;
    }
 
    /**
     * Classifies a given instance. Either this or distributionForInstance()
     * needs to be implemented by subclasses.
     *
     * @param instance the instance to be assigned to a cluster
     * @return the number of the assigned cluster as an integer
     * @exception Exception if instance could not be clustered
     * successfully
     */
    public int clusterInstance(Instance instance) throws Exception {
        return 0;
    }
 
}

This clusterer put all the data it gets on the first cluster of index 0. The number of clusters is 1 and it is the worst thing we can imagine. The buildClusterer method doesn't do anything.

Handling options in a weka clusterer

Now we have the structure to build a new clusterer, it could be interesting to be able to handle options for it. In our particular case, being able to give the parameters file of the ECJ algorithm as an option can facilitate the tests with different algorithms without changing the clusterer class.

In this context we will add a String private field to our clusterer containing the path to this file:

Path to the ECJ parameters file:

private String parameterFilePath;

Then to have the option handling for our clusterer, it needs to implement the OptionHandler interface. This consists in only 3 methods:

  • Enumeration listOptions() which returns an enumeration of Option.
  • void setOptions(String[] options) throws Exception allowing to change the options from an array of string containing the options. As we will see, some helpers are available in the weka.core.Utils class to parse the string.
  • String[] getOptions() which gives an array of string containing the different parts of the options. If given on a command line, each part of the array would have been seperate by a space character.

These 3 methods in our case:

/**
 * Returns an enumeration describing the available options.
 * 
 * @return an enumeration of all the available options.
 */
public Enumeration listOptions() {
 
    Vector newVector = new Vector(1);
 
    newVector
            .addElement(new Option(
                    "\tSpecify the path of the properties file to load ecj parameters\n",
                    "F", 1, "-F "));
    return newVector.elements();
}
 
/**
 * Parses a given list of options. Valid options are:
 * 
 * -F "FilePath"
 * Set the path to the ecj parameters file.
 * 
 * @param options
 *            the list of options as an array of strings
 * @exception Exception
 *                if an option is not supported
 */
public void setOptions(String[] options) throws Exception {
 
    String optionString = Utils.getOption('F', options);
    if (optionString.length() != 0) {
        parameterFilePath = optionString;
    }
}
 
/**
 * Gets the current settings of the Classifier.
 * 
 * @return an array of strings suitable for passing to setOptions
 */
public String[] getOptions() {
    String[] options = new String[2];
    options[0] = "-F";
    options[1] = parameterFilePath;
    return options;
}

Here the -F option allow the user to specify the ECJ parameter file. We can add a -C option to specify a checkpoint file if needed. I won't do that here but you can make it by yourself, it's pretty simple now you have the structure.

In order to have the option handling fully operational in the GUI, you need do a little bit more. Indeed, the GUI uses the getters and setters of the option fields to get and set them. So you must implement them if you want the GUI to work correctly. They must begin respectively by get and set, which means:

setter

public void setBlabla(TypeOfTheField t)
{
    /* your code here */
}

getter

public TypeOfTheField getBlabla()
{
    /* your code here, must return a value of type TypeOfTheField */
    return nameOfTheField;
}

you can also put tipText for your options. To do that just create a method with the signature: public String yourFieldTipText().

See weka's documentation for a full explanation on how to perform option handling and make it available in the GUI.

Now your options are fully available. I built a complete exemple at weka.clusterers.EcjBasedClusterer.java

Calling the ECJ algorithm:

The ec.weka.Evolve class has only static fields and methods. If we're launching the algorithm from scratch (i.e. not restoring from a checkpoint), the first thing to do before running any algorithm is setting the learningDataSet (if needed). The name can be missleading but it can be used to store any Instance object that we want to use in the ECJ algorithm. It is done with this method:

setLearningDataSet:

Instance[] instances = new Instance[numberOfInstance];
/*
 * Fill instances with the wanted Instance objects.
 */
ec.weka.Evolve.setLearningDataSet(instances);

Then you're ready to run the algorithm. Just call the run method giving it the array of String containing the option you would have typed on the command line. For exemple, if i'm running an algorithm from scratch:

calling the run method with a parameter file:

String[] parameters = new String[3];
parameters[1] = "-file";
parameters[2] = parameterFilePath;
Individual[][] results = ec.weka.Evolve.run(parameters);

There you are, the algorithm runs with the parameters file you specified in weka options for your clusterer and you got the results back at the end of the computation.

The parameters array given to the run method must contains the same thing as if you were calling Evolve from the command line. Don't forget that when calling from the command line the first argument in the array is the name of the command typed. Here i left it blanks because its not read at any time. But it can be discarded or ignored so i prefer having one empty string at the begining of the array and specify the options in the rest of the array.

If you want to load from a checkpoint, you don't need to set the learning data. I mentioned that the data given by weka to ecj are placed in a manner that allow them to be save when checkpointing. So when resuming, the data previously saved are restored and used to complete the evolution process. If new data are given, they are ignored. Here is a bit of code showing how to launch an algorithm from a checkpoint file:

String[] parameters = new String[3];
parameters[1] = "-checkpoint";
parameters[2] = parameterFilePath;
Individual[][] results = ec.weka.Evolve.run(parameters);

Conclusion

Now you should be able to construct simple Clusterer for weka using an ECJ algorithm. Practising these two softwares, you should then be able to build more complicated things. The class weka.clusterer.ECJBasedClusterer shows an exemple of clusterer taking two parameters, one string for an ECJ parameters file and another string for an ECJ checkpoint file. The last one overrides the first one if both are given. Then the clusterer does nothing in its buildClassifier method and it always returns 0 for the cluster. It's useless as a clusterer, the goal is just to show the overall construction of a clusterer with some options.

This class is available at: src/weka/clusterers/EcjBasedClusterer.java.

Be aware that even if this exemple shows only how to build a clusterer, the work to build a classifier or an associator is fairly the same. You just need to change the base class used in weka and construct an algorithm for classification instead of clusterisation.

In conclusion, the capabilities of this interaction between ECJ and weka only depends on the algorithm you're writting with ECJ and the use of the results your doing in weka.

written by Romaric Pighetti in 2012/01.