During the supervised training phase, an ANN builds an approximated function that matches a list of inputs to a list of desired outputs.
In the search phase of a genetic programming algorithm, a program (take the example of an unknown mathematical function that must be approximated using appropriately sin, cos, polynomial, exp functions, and x + - * operators) undergoes multiple transformations so that it approximates as better as possible some inputs to the desired outputs.
The two approaches seem to have the same functionality. However, I think that genetic programming is likely to be more "powerful" than ANNs since they can dynamically build very complex programs (functions) that maximize some utility function (i.e. min {fitting error}). However, ANN is just optimizing a predefined set of coefficients (related to a fixed set of neurones) to learn the input/output matching.
Depends completely on the problem. Look for instance at the publications on Alternating Projection Neural Nets (Robert Marks and Seho Oh). Many of the implementations fell mid way between the systems you describe in your note. The key is for what set of problems will the system converge to a useful result.
- what is the quality and completeness of your measures and training data?
- what sort of singularities exist in your training data?
- what are your restrictions on topology, both of your ANN and your "genome"?
- what error measures are ultimately important?
For your problem, which approach works - that one is the "more powerful"
GP and ANN are not mutually exclusive. The ‘most powerful’ approach would be a combination of the two: use genetic programming to optimize the 'structure' of your ANN. There is a considerable literature on the topic. We have used GP to improve the connectivity of a spiking neural network:
Ju et al (2012) Effects of synaptic connectivity on liquid state machine performance, Neural Networks 38: 39-51.
I think that what you are trying to say is that both can perform nonlinear regression.
It is known that a neural network with a single hidden layer made of a finite number of neurons and a linear output neuron is a universal approximator: it can approximate any sufficiently regular nonlinear function with arbitrary accuracy in a bounded domain of inputs. Therefore, a neural network of finite size can approximate whatever function you find with genetic programming. Therefore, why would one use genetic programming?
But there is much more to the question than uniform approximation. The purpose is not just to find a model that fits the training data, but, much more importantly, a model that generalizes efficiently. Therefore, the real question is: how does the complexity (VC dimension) of a function found by genetic programming compare to the complexity of a neural network, when trained on the same data with the same final training error? I do not know the answer, but some results may be available in the literature; if not, it would be a challenging research topic.
Furthermore, there are many additional issues such as training time, computational complexity, number of hyperparameters (i.e. user-defined quantities that are not obtained by training), scaling of the complexity with the number of variables, etc.
Both have similarities in what they can do, but depending on the problem some times ANNs will fit fine, some times GeP will; i.e., ANN are usually straightforward to implement and work pretty well but their blackbox nature make them non-user friendly. On the other hand GeP results are often human friendly, but coding such an algorithm from scratch can be painstaking. Notwithstanding one has to take a look at the NFLT which states that two algorithms are equivalent when their performance is averaged across all possible problems. For a more detailed material check Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization."
I think you mean that both can perform nonlinear regression, which is correct.
It is known that a neural network with a single hidden layer made of a finite number of neurons is a universal approximator: it can approximate any sufficiently regular function with any prescribed accuracy in a bounded region of input space. As a result, any function found by genetic programming can be approximated to any accuracy by a neural network of finite size. Therefore, there is no point in using genetic programming.
However, there is more to machine learning than just uniform function approximation. You do not want a model that just fits the training data; what you are really interested in is a model that can generalize efficiently. Therefore, the key issue, for answering your question, is complexity (in the VC-dimension sense): how does the complexity of a model obtained by genetic programming compare to the complexity of a neural network that was trained on the same data and has the same final training error? There may be answers to that question in the literature; otherwise, it could be a challenging research problem.
Additional issues should be considered: training time, computational complexity, number of hyperparameters (i.e. parameters that must be defined and twiddled by the model designer), scaling of the complexity with the number of variables, etc.
Depends completely on the problem. Look for instance at the publications on Alternating Projection Neural Nets (Robert Marks and Seho Oh). Many of the implementations fell mid way between the systems you describe in your note. The key is for what set of problems will the system converge to a useful result.
- what is the quality and completeness of your measures and training data?
- what sort of singularities exist in your training data?
- what are your restrictions on topology, both of your ANN and your "genome"?
- what error measures are ultimately important?
For your problem, which approach works - that one is the "more powerful"
I think GP in better. because it's interpretable, but ANN is black box and just give u an answer
Genetic and Evolutionary algorithms explore a model space in a way that improves heuristically over random search on the basis of various implicit assumptions about good solutions being near other good solutions, including better solutions, in some sense. The family of functions you listed represent a further explicit assumption that the target function can be well approximated with these. This is reasonable because you have many bases nestling amongst them which allow you to reconstruct (smooth) functions with arbitrary accuracy. Of course, you can always approximate such a function as a sequence of linear (straight line) segments, with arbitrary accuracy, but discontinuous derivatives (slope changes incrementally between segments). Using splines allows you to ensure a continuous derivative, but any fixed order polynomial approximator will still in general result in discontinuity of higher order derivatives.
Neural networks typically use a basically linear weight equation for approximation, viz. dotproducts and Perceptrons, with MLP/Backprop using sigmoids to smooth the derivative. RBFs use distance from the weight vector for approximating a circular (hyperelliptical) region. These may be combined with other functions, with various polynomial and exponential forms being common, and wrapping these in a kernel being common in neural networks (and in particular SVMs) that use the kernel trick. Different layers may use different functions of inputs and weights, or different kernels, but normally they aren't combined together as flexibly as GP can achieve.
On the other hand, intelligent use of ANNs may provide various functions of the original attributes as additional inputs, e.g. if an inverse square law or an entropic law is theorised. More generally, the space of structural and functional variants of an ANN may be explored conveniently with evolutionary techniques, including genetic models, swarm models, etc. In this case, where ANNs and EAs are combined, there are variants depending on whether the weights get learned by standard procedures (e.g. Backprop) at each generation, or whether they get learned by evolutionary techniques.
The advantage of such hybrid techniques is that they can deal with cases that are difficult for one alone as a universal approximator, and in particular that they can potentially find simpler solutions than either alone, viz. a more parsimonious model.
@Monther and Mohammad - the results are equally interpretable, as both GPs and ANNs are just functions made up from a set of basic functions and constants/weights. The real issue is simplicity, to find the shortest formulation, so that a function that is trigonometric or cyclic, for example, is defined using trig functions rather than out of linear functions that has the system working mainly to approximate these same trig functions as a sequence of straight line segments.
But for any finite set of atomic functions, there will be other functions that are not simply approximated by any subset of them. This is basically a reformulation of the No Free Lunch theorem (NFL).
but with all due respect i disagree with you @David,because ANN cant explain how get the result, and i talk about just ANN, not hybrid methods. if you use only ANN, you can't have explanation for it, but in fuzzy logic, expert systems and GP you have a explanation of it.
There is no single technique outperforms all others over the full range of problems. So nobody can answer this question precisely without more details about the problem and the available data
Both are powerful. However, the decision on which of them should be used depends on the problem at hand. From the overfitting perspective, I'll go for the better generalising ANN.
I am probably iterating what others have previously said, but this is very much dependent on the problem at hand and your specific goals. That being said GP is known to suffer from the problem of bloating, e.g. construction of overly complex models for minimal predictive ability gain (from a point and on). On the other had, GP's do have the advantage of interpretability (provided that you are careful enough and avoid bloating). ANNs are generally thought of as more robust (and this is my experience too). I would experiment with both. You may even combine elements of them, e.g use a GA for feature selection and ANN for the predictive model building construction (there's ample literature on that).
Mohammad writes... "i disagree with you @David,because ANN cant explain how get the result, and i talk about just ANN, not hybrid methods. if you use only ANN, you can't have explanation for it, but in fuzzy logic, expert systems and GP you have a explanation of it."
Agreed that the explanation capacity is a bit more than a pure ANN, but surely this applies just as much to the others, and the trickiest case to explain is the genetic (or evolutionary) "decision", or indeed any metalearning result of higher complexity such as decision forests or boosted learners. And the question that you raised was how interpretable, not whether there was a built in explanation mechanism. The ability to be interpreted derives simply from having a well defined representation.
So all of these systems in their pure form just provide answers, and some work is required to understand the answers. In the original question the focus is on ANN vs GP regression, and we will in general get different functions (e.g. weighted sums) of functions for the diffferent algorithms, although they could sometimes be the same, and by default there is no explanation, but there is always the possibility of interpretation by looking at the representation (opening the black box - all are black boxes except the Expert System which is by definition handcrafted and completely visible to the Knowledge Engineer).
More generally, for classification problems, ANN/SVM and Learned or Knowledge Engineered Systems (Decision Trees, or Expert Systems), produce graph-like representations that do indeed require some work to understand, or massage into a reason for the decision (and doing this is very similar whether for a learned network, a learned tree, or a hand crafted network or tree or forest (all just functions on graphs with weights). The genetic (or evolutionary) approaches tend to be hybrid by nature, as you need a representation and atomic building blocks to work with, so it boils down to the explanation mechanisms appropriate to that representation, which is often going to end up being representable as a graph too (it might very well be an ANN or ES/DT/DF). Usually metalearners, including GPs used as metalearners, will be blacker boxes than simple neural network or decision tree representations, and if targeted to produce simple graphs/trees as base learners, GPs will produce similar representations (if you constrain them appropriately so they can't get more complex).
Mohammad wrote... "i disagree with you @David,because ANN cant explain how get the result, and i talk about just ANN, not hybrid methods. if you use only ANN, you can't have explanation for it, but in fuzzy logic, expert systems and GP you have a explanation of it."
As David said, expert systems don't belong here. Nor do hand engineered fuzzy logic systems since neither are learned functions.
What is left are ANN's and GP. In practice, GP's are just as difficult to interpret because the resulting programs look like they were written by Perl programmers on crack. With ANN's, at least, you have some idea of the gradient of the error function which can tell you which variables seem to be important in various contexts.
None of this really matters in industrial and commercial systems because such systems typically train the highest quality classifier possible, regardless of explanatory power and then train a separate reason code classifier. The reason code engine can have access to the internals of the main classifier if desired, but the reason codes really don't have to have all that much to do with the main classifier so much as with regulatory processes.
As Patrick indicates, the combination is probably better than either individually. This is the idea behind Ensemble learning (http://en.wikipedia.org/wiki/Ensemble_learning) and there is significant evidence to support it (e.g., the DREAM project challenge organizers emphasize this idea). For example, many decision trees are better than one (random forests) and multiple recommendations are better than one (kNN).
However, ensemble methods tend to be difficult to interpret. A single tree is intuitive and immediate, but a random forest??? That said, an ANN is essentially an ensemble method -- each "neuron" is a classifier. Moreover, as "deep learning" shows, the more sophisticated the network topology, the more powerful the ANN. So in principle, an ANN's explanatory power can be extended indefinitely via ever more complex underlying network topologies, but only the ANN's with the simplest topologies can be readily interpreted.
In most cases, ensemble method is useful. But if you use it to combine a method with good performance with another that has bad performance, the final outcome most likely will not be satisfied.
In relation to metalearners, including GP, ensembles, stacking, fusion, boosting, bagging, random forests, rotation forests, etc. it is not just how good the individual methods are, the base learners, but how diverse they are. In fact boosting and bagging and evolutionary techniques specifically encourage the use of weak learners as components, and have mechanisms to encourage diversity (i.e. using randomness, or working harder on cases that others get wrong, or searching harder in areas that haven't been explored yet, or removing attributes that are strongly relied on in earlier learners.
In terms of measuring diversity, appropriate techniques include correlation or kappa (which compare in a chance-correct way without a particular direction) or in a somewhat different model, informedness (if you start with one preferred/best method and are looking for others that are complementary to it). For metalearners that use voting or linear combination, the system will indeed resemble something that could be learned with additional layers of a neural network, etc.
In fact, with an ensemble of classifiers, you can ignore the answers they all get right, the easy cases, often 80% or so according to the 80:20 rule... What is important is that they have different wrong answers (particularly for multi class problems or regression problems) and are not a committee of yes-men! And if the problem is unbalanced (e.g. the diseases of interest are relatively rare compared to the number of healthy patients) it is important you don't just consider how many healthy patients you identify correctly (you get high accuracy by chance in such cases, and low accuracy by guessing when there are many reasonable choices). This is why all the methods we use for maximizing diversity involve minimizing a chance-corrected measure such as correlation or informedness. It is also important to compare the actual predictions, not just throw that info away and compare the right/wrong statistics.
hello,
i completely agree that GP is much more powerful than ANN. 1- the chance of falling in local optima in GP is lower than it is in ANN 2- GP cover non-linear equations easily with no additional complexity. whereas in ANN (either feed-forward or back propagation) it can be very computational to decide/caliber the needed nodes and weights. 3- its results are human readable, and finaly which is not the case in the ANN black-box.
Three points on this discussion:
1) No Free Lunch Theorem (to be found e.g. in Duda et al 2001) states there is no method that can be classified a priori to be superior to another one in a general sense. Its superiority can only be checked with a particular data set and applies only for this particular data set.
2) Soft-Computing (SC) aka Computational Intelligence (CI), which include both Genetic Programming and ANN, aims at the complementary usage of techniques included in its framework (and not to the competitive comparison).
3) In this context I found the paper by Furuhashi 2001 (Furuhashi, T., "Fusion of fuzzy/neuro/evolutionary computing for knowledge acquisition," Proceedings of the IEEE , vol.89, no.9, pp.1266-1274, Sep 2001
doi: 10.1109/5.949484) valuable. He defines the concept of characteristic function of each methodology within SC/CI, which defines the role of each methodology in a complex system. Following this concept the characteristic function of ANN would be the approsimation of functions, while this of Evolutionary Computation (including Genetic Programming) would be optimization.
Dear Bekir Karlik: Genetic Programming does not mean Optimization using Genetic Algorithms...
According to wikipedia:
"
In artificial intelligence, genetic programming (GP) is an evolutionary algorithm-based methodology inspired by biological evolution to find computer programs that perform a user-defined task. Essentially GP is a set of instructions and a fitness function to measure how well a computer has performed a task. It is a specialization of genetic algorithms (GA) where each individual is a computer program. It is a machine learning technique used to optimize a population of computer programs according to a fitness landscape determined by a program's ability to perform a given computational task.
"
In my comparison of both ANN and GP, I focused only on the ability of both approaches to build an mathematical function that fit efficiently some input to the desired outputs. For example, GP can suggest that the function F that ensures F(x_i) = Y_i for all i=1..N, is F(X)= XCOS(X) - 1/X - EXP(-X) + 3 ... with a given approximated error which must be minimized... In such GP problem, the population is a set of candidate function that are competing to adequate a predefined set of I/O.
3 points: two general and one more detailed to exemplify with a specific case study (and a remark on design diversity, also for ensemble methods).
I agree with the many who highlighted the essential role of the problem to solve and of a sound modelling strategy. This is almost a truism. It would be nice to have a general purpose optimal approach. If it were possible to find such a silver bullet, however, this problem would not be as scientifically interesting as it indeed is. This implies moving from a solution-driven analysis to a problem-driven one.
[Point 1.] Domain-specific knowledge may often help to shape the semantics which innervates appropriate mathematical formulations. E.g. see http://doi.org/cqp4jr (note that GP is not rarely considered a black-box method, without an easy explanation of the resulting mathematical equations!), http://doi.org/ff3z6v (notice the role of a priori knowledge for designing understandable GA; notice also the discussion on the No Free Lunch Theorem), http://ur1.ca/fwp0v , http://doi.org/b76k39 , http://doi.org/cr55f8 , http://doi.org/bpjz4s
The problem itself (if mastered by a domain expert) might suggest the more appropriate mathematical formulations (if mastered by the computational modeller). This could even dominate the outcomes, maybe almost irrespective of the choice between ANNs and GPs. Brute computational force might perhaps emulate a wise mathematical formulation in some circumstances. However, whenever possible, I would recommend to first start from what we (domain-expert + modeller) know of the problem, letting only the unknown to the computational analysis. This means avoiding black-box brute force with the trendy cool tool of the moment.
In this respect, the question about ANNs vs. GPs might need to be complemented with a comment on the solution process.
Semantically-driven heuristics and a-priori knowledge may not rarely help to reduce the complexity of the “black-box” modelling exercise (either ANNs or GPs).
[Point 2.] For example, pre-processing (and post-processing) the quantities might be a valuable modelling step. E.g. see http://doi.org/fqrgxv , http://doi.org/bqj4ks , http://doi.org/dhzscw , http://doi.org/btb7zp , http://doi.org/b96vnf , http://doi.org/dh9s3w
Sometime a “good” – maybe straightforward – data transformation model (D-TM) might greatly simplify a subset of nonlinearities, easing the subsequent architectural design of either ANNs or GPs. After all, this is what the successful "kernel trick" tries to do in another technology - the support vector machines. Concerning the technologies we are focusing on, transformations may apply to both the outputs/quantities to model and the inputs/covariates. This may be trivial as a normalization or a simple monotonic transformation. It might be a multivariate transformation of array-quantities (such as a change from the time-domain to the frequency-domain; a wavelet transformation; or simply a different coordinate system, from e.g. polar to Cartesian or vice versa: slope and aspect in a DEM vs. the x/y components of the elevation gradient, or speed/direction vs. the x/y/z components, ...). Even GP itself may help: http://doi.org/cpvc65 , http://doi.org/dh39gs . Singularities (e.g. division by zero, log of negative values, ...) in the original mathematical formulation might sometime be removed with an appropriate D-TM so as for the space of parameters to have less constraints. If it is possible without introducing ill-conditioning, it might be worthy to do it.
Here the point is to avoid as much as possible wasting resources for the learning/evolution process of the black-box regression tools, by letting the semantics of the problem to drive obvious D-TMs *before* the less obvious ones (which then are let to the ANNs or GPs). This also implies impeding nonsense outputs or intermediate results to be computed (such as negative values for nonnegative quantities, see e.g. an introduction to semantic array-based constraints, http://ur1.ca/fuufd ; sometime just simply preprocessing a positive quantity with a logarithmic transformation may be enough... instead of asking the ANN/GP to struggle for blindly finding a configuration of parameters which preserves positiveness!).
[Point 3.] As others already did, I would like to emphasize that ANNs and GPs are not mutually exclusive. In particular, evolutionary techniques (which are not exactly the same as GP) may be fruitful in the training phase of ANNs. I would recommend the wide available literature. For example, classical papers are http://doi.org/fh26wh or http://doi.org/chwz3d . General: http://doi.org/fgvqd9 .
[Details: point 3.bis] Finally, something derived by my personal practice (however, please remind the no-silver-bullet/no-free-lunch premise!). Among the many existing, an example on which I am able to offer some better comment is the SIEVE architecture. Disclaimer: I co-authored the idea. SIEVE is for Selective Improvement by Evolutionary Variance Extinction: http://ur1.ca/fuugl. It is a general evolutionary architecture which proved helpful in ANN training of complex quantities (initially applied in neuro-dynamic programming).
The idea at the basis of the SIEVE architecture is as usual to avoid getting stuck in training local minima. It is an array-based architecture (multiple parallel optimization runs), designed to repeatedly use a custom optimization strategy of choice (e.g. descent methods) ignoring specific details of it. So one can reuse SIEVE with several different custom strategies. Basically, this is a sort of evolutionary ensemble method in which the array of models to ensemble is drastically reduced after ensuring the most promising ones to evolve exploring their expressiveness. What survives might even be just the best performing model, or a small set of winners to ensemble.
(In retrospect, I would systematically suggest to always preserve the diversity of efficient parametrisations in a broad family of models - which SIEVE allows to explore - and to ensemble them instead of selecting the single best performing model. The more diverse the winners, the better. SIEVE is able to increase diversity only as far as the designed family of models allows diversity to be expressed! Design diversity may also be recommended for mitigating software uncertainty, from which complex ANNs and GPs are of course NOT exempted http://ur1.ca/fx31x ).
The architecture tries to exploit the best performance that regressors (say ANNs) with a few parameters may offer, when (almost-)optimally trained. This means reducing the risk of overfitting, while the appropriate number of parameters might be increased e.g. with an early stopping strategy: first validate well-trained ANNs with few neurons, then explore possible better performances with more neurons; stop when *validation* errors increase.
The SIEVE procedure asks to explore the space of parameters of a regression-family (e.g. the weights of an ANN, or maybe even the activation function of each neuron - sigmoidal, Gaussian, etc..) with multiple arrays of parameters (e.g. by generating an initial random set of them). Then, a fraction of the best performing parametrizations (assessed after a very short training phase of each array) is selected (“sieved”) and “similar” array of parameters are then generated. The key idea is to iterate the training-selection-generation (T,S,G) by devoting the most expensive training to the most promising arrays.
A couple of sentences from http://ur1.ca/fuugl might perhaps be of some use:
“The core of the SIEVE architecture is to use iteratively a selection (sieving upon an inverse geometrical series) of the best parameter vectors, reducing exponentially the number of parameter vectors surviving at the next iteration. We compensate this reduction with a geometrical extension of the computational resources dedicated to train each parameter vector (making so that each iteration uses the same amount of computations as the others), until the absolutely best vector passes the last sieve.”
“After each sieving selection phase, the survived vectors are put before a generative phase in which some other vectors are generated from them by adding perturbing noise. The noise variance decreases increasing the iteration number, in order to preserve the best training result achieved from the last parameter vectors, therefore leaving the possibility to significantly perturb some vector (exploration of new areas of the parameter space). All the sieved parameter vectors and the new generated from them are then trained, and so on for each iteration.”
Why not the two together !!
GPNN was developed to improve upon the trial-and-error process of choosing an optimal architecture for a pure feed-forward back propagation neural network (NN). Optimization of NN architecture using genetic programming (GP) was first proposed by Koza and Rice. The goal of this approach is to use the evolutionary features of genetic programming to evolve the architecture of a NN.
You don't need GP to evolve NN structure. You just need a non-parametric ANN solver. Much easier than wedging GP into the mix.
@Ted Dunning,
The question is: Which is more powerful: Genetic Programming or Artificial Neural Networks? answer the question please !!
Dear Engr. Nabil Belgasmi, After all both are the tools. How one can judge and compare about ANN and genetic algorithm without the knowledge of the problem. The comparison of power-fullness of both depend on nature and complexity of the problems.
Dear Engr. Nabil Belgasmi, Here is an article entitled ' Using Genetic Programming for Artificial Neural Network Development and Simplification'. you may access this via the following link.
http://www.wseas.us/e-library/conferences/2006venice/papers/539-232.pdf
@Farid,
The answer is that it entirely depends on the application and the context. In general GP has a much harder time with difficult optimization landscapes due to the complex nature of the relationship between program form and results. There is no significant difference in the set of functions that can be expressed, but it is often true that ANN's have a simpler way to express important prior information about what kinds of functions are plausible.
The purported advantages of GP formulations to produce inspectable forms is largely illusory since noise in the input causes the resulting GP programs to be very hard to understand.
It might be worthy to underline how broadly the topic is discussed (e.g. http://ur1.ca/fyjno ) aside from peer-reviewed works (e.g. http://doi.org/bfqc8b http://doi.org/d843wg ). However, I hope this should not be an excuse to avoid corroborating (http://ur1.ca/fz91b ) scientific statements with examples or literature references (in alternative, a hilarious section might be opened somewhere with proofs by obviousness, intimidation, lost reference, … see e.g. http://ur1.ca/fz8xg ). A general remark is that case studies and suggested applications seem to heavily depend upon the modelling architectures which the authors are aware of (I hope this collective discussion/review exercise might offer a plurality of perspectives). For example, in http://ur1.ca/fyjno several answers recall a frequent approximate classification of domains in which GPs or ANNs are supposed to perform well
GP -> optimisation/search problems (??)
ANN -> pattern recognition (??)
Of course, many architectures exploited GPs and ANNs in those domains. However, hard optimisation problems such as applications of stochastic dynamic programming (SDP) may be approximated with neuro-dynamic programming (NDP, based on ANNs) even where exact SDP fails due to the explosion of the required computational time (e.g. http://doi.org/pp3 ). On the other hand, pattern recognition and similarity problems might be addressed with GP (e.g. http://doi.org/dptdkp ).
As I recalled in my previous answer http://ur1.ca/fyjmw , successful applications might strongly depend on how well the semantics of the problem is described (maybe with pre/post-processing). This “modelling best practice” may be underlined irrespective of whether the preferred approach is ANN, GP or a hybrid/ensemble approach (http://doi.org/d843wg ).
I believe that the training of neural networks using genetic programming in addition to optimal learning, allows structures of neural networks replace the trees and generate new neural structures. it's cool
Which of GP or ANN is appropriate is indeed highly dependent on the problem, and highly related to choice of representation. Choosing the most appropriate representation or model is often the most important step to understanding and solving the problem, and may determine whether a GP or ANN solution is more straightforward, and often a hybrid technique can be used - not just hybrid GP+ANN, but multiple levels of any kind of learning, even the same kind of learning (cf. Deep Learning).
In terms of models and heuristics, sometimes optimising the model does not solve the problem, and proving something about the model need not say much about the problem. It is important to understand the assumptions that are being made by the model (which may not be valid for the problem), as well as the costs and biases associated with particular learning techniques and heuristics (which may make a good solution to the problem impractical or impossible to find). Always be wary of claims to have optimised a problem when in fact one has heuristically reduced the problem to a poorly fitting model and optimised that.
I have found their relative performance and applicability to be very much problem and data dependent.
I agree with comments [@Ted] that although GP is often cited as less black-box that isn't necessarily the case - it does require effort (I've found I've had to strongly limit the size of the GP models to ensure readability, although avoiding 'bloating' can also help guard against over-fitting).
@Alejandro - Many years ago we did experiment with quite reasonable effect in combining neural nets via GP for molecular bioactivity predictions. GP-derived NN ensembles could out-perform any of our best individual networks on that problem:
https://www.researchgate.net/publication/221009463_Combining_Decision_Trees_and_Neural_Networks_for_Drug_Discovery?ev=prf_pub
https://www.researchgate.net/publication/228836821_Genetic_programming_for_combining_neural_networks_for_drug_discovery?ev=prf_pub
...we didn't try combining GP and NN learning aspects into a single algorithm but that sounds powerful if over-fitting can be controlled.
I agree [@David] that choice of problem/data representation is critical, and (in my humble experience) that crucially guides the choice of learning paradigm (with its consequent parameterisation requirements and model selection constraints).
Outside the hands of experts, standard ANN learning might not often generalise as well as Support Vector Machine learning (where concepts from Vapniks' statistical learning theory are used to constrain models). Kernels define performance and SVM utility for particular problems (as transfer function does in ANN), and I believe there has been some work on using GP for problem-specific kernel customisation.
Conference Paper Combining Decision Trees and Neural Networks for Drug Discovery.
Article Genetic Programming for Combining Neural Networks for Drug Discovery
Adding gaussian noise to training data may offset potential overfitting in some cases when combining GP and NN. Test for generalization, e.g., compare test on straight set with test on 'noisy' test set..
@Mikal
The idea of extending limited training data with alternative records derived from existing ones is a good one, and is not limited to simple addition of Gaussian noise. This is often exploited by presenting translations or rotations of images, or changes in contrast or brightness.
This can definitely help with certain kinds of over-fitting, but it is far from a panacea.
@Steven,
When you say "Outside the hands of experts, standard ANN learning might not often generalise as well as SVM's", I think that you may be confusing folks a bit.
The standard practice in ANN learning for at least 15 years has been to include strong forms of regularization. Many of these forms are directly comparable to the margin regularizers of SVM's, but many are not directly comparable.
No attempt at panacea, just a suggestion which may be helpful in some situations. There is no panacea. The problem and its amenity to suitable representations plus the ability to manipulate those representations are IMHO the determinants of which solution strategy is most suitable (or "powerful"). A strategy which is "most powerful" when used on one problem may be useless when applied to another. Like most things in the real world, which is better "depends" ... and depending on representation, etc. there are ways to make a solution strategy a better 'fit' for a specific problem or class of problems, e.g., the addition of Gaussian noise worked for me in my ANN probs (evolving architecture, topology, connectivity, algorithms, training set presentation [order, adaptive subsets, etc.], et al.). Essentially = an Evolution Program for evolving NNs.
Note re: data presentation ...
1. Iterative vs Batch presentation (=> update)
2. Order physical, reversed, or permuted
3. Include/Exclude acquired exemplars
4. Clear vs Noisy presentation
... varied all of these (and other factors, e.g., pruning, shortcut connections, adaptive training parameters, transfer functions, etc., etc.) probabilistically.
I have published a couple of conference articles comparing GP and ANN. In my humble opinion, ANNs' pros: fastest training phase. GPs' pros: less error for most regression problems (because the universe of possible GP models is bigger than possible ANN models), if you apply multiobjective GP you have a choice between more complex formula with less error, or simpler formula wih more error. ANNs cons: you need expertise or lot of work to determine a good topology of the network (number of layers, nerons per layer, transfer funtions). GPs cons: lot of time for training (if you look for very small error)
Rodriguez, NN topology and transfer functions can be determined via evolution.
indeed it depends on the problem, data and representation I have published some papers with both GA and ANN, each of them is suitable for some kinds of problem
... and an evolution system can fit ANNs to problems not "fittable" by regular training methods and/or rules of thumb. By including pruning and shortcut connections as mutation operators evolution can "shape" an ANN to model the relationships btwn input and output variables. See http://home.sprintmail.com/~kalki/Dissertation.htm
Evolutionary Neural Network is more powerful than separately used GA and ANN and it is highly suitable to performing classification task using very complex data sets in Chemoinformatics field.
GAs are not part of my picture ... evolution programs (e.g., -similar- in motivation to Michalewicz) are a different story => generic. The binary encodings of GAs IMHO fall short of what's required/desired in evolving complex systems. OTOH, an evolution system which manages the application of the high-level operations of generations ... evaluation, selection, reproduction, mutation ... can require an interface for evolvable representations of any kind. One has only to 'fit' a data structure with operators that implement the lower level operators of crossover, mutation. For example, for NNs I provided Crossover and Mutation operators for structure and for values, thereby providing a unified Genetic/neural system with specialization ("handles") provided for the problem. ALL aspects of the NN can be evolved via an "EvolvableNetwork". This is not a GA and an ANN ... it's a single system. I'm working on evolvable decision trees now.
I dropped the 'classic' GA very quickly because I didn't like the translation steps ... which I felt were unnecessary. Why translate values back and forth (SLOW!!!) btwn binary and 'actal' values. It is better to manipulate the actual values ... which are already have binary :::representation::: in the machine. One has to be careful with real numbers (ugh...). Looking back, this was a logical step after devising cumbersome schemes for managing building blocks in the form of binary representations of values. Removal of the GA translation steps yields very welcome increases in speed.
So ... my view on a broad range of optimization problems is that given a proper representation, evolution -can- provide unexpectedly superior solutions. Sometimes it's not just what we use but also how we use it.
Re: evolution + NNs, we should remember separation of concerns, e.g., evolution system + neural system with appropriate implementation of the interface required by the evolution system and implementations which reliably reflect the desired effects of evolution's operators: reproduction/crossover, mutation, evaluation/fitness (operators for all native types provided by the e-system). Note: My humble efforts produce a human-readable record of architecture, topology/connections, weights, and parameters, e.g., something like this starting point for the XOR test ...
Cluster
{
Cluster
{
{ Build 2 Input }
}
Cluster
{
{ LogisticA Build 2 }
}
Cluster
{
{ LogisticA Build 1 }
}
}
Connect
{
Connect { 1 2 } { 3 5 }
Connect { 3 4 } { 5 }
}
... in the actual test the evolvable network discarded the 2nd layer and selected the Gaussian transfer function for output. For the Pilot Knowledge Engineering problem the starting point was ...
Cluster
{
Cluster
{
{ Input Build 9 }
}
Cluster
{
{ Sine Build 6 }
}
Cluster
{
{ LogisticR Min 0 Max 10 } // Ranged Logistic
}
}
Connect
{
Connect { 1 9 } { 10 15 }
Connect { 10 15 } { 16 }
}
See http://home.sprintmail.com/~kalki/Defense/Defense66.htm for an example of what the system produced.
@Keenan I knew you can evolve the topology of ANN (never done by myself, though). When I said "expertise" I was refering to that same idea (you need to know both ANN and evolutionary approaches). Many researchers using ANN are not computer scientist, they only want to master one technique and use it in their chemistry/engineering/biology problems,so they do not have any interest in becoming AI or machine learning experts. And if you evolve the ANN then you also need more time for all the regression, which eliminates the great advantage of ANN (short processing time at training phase). Any way, I think in general the more precise you want the regression (aproximation function), the more time you spend; this applies to pure GP or ANN+evolutionary approaches or any other hybrid technique (X method + evolution).
Rodriguez, I hear you. No argument. My 'noise' was just intended to provide some more background. Note: All this talk has made me want to pick up the old system again ... something which I do periodically as the C++ standard evolves.
Funny thing ... I was far from an expert in AI, etc. when I created my systems. The driving understandings in my work were biobehaviorally-based and systems-science-oriented. I contemplated things in the Natural Order of organisms, species, cultures, migrations, patterns of interaction (competition, cooperation), etc. . . . digressing ...
The topology part ... once we identify and implement rules/constraints which prevent the entry of conditions which might require repair we can turn it loose. I was pleasantly surprised to see how evolution managed to model Pilot KE, Postoperative Routing, NASA-O-Ring Failure Prediction, and Fisher's Iris Data. It successfully separated inputs and associated them with specific outputs ... with no a-priori awareness or knowledge re: meaning. In reality, the types of successes such results suggest are both intellectually pleasing and frightening. We have no way of predicting what our continually sophisticated systems will produce, i.e., emergent properties. My 'vision' at the time (~1993-6) included machines which can design (EP), build, and program (GP, GEP) themselves. The machines' capacity for evolving solutions does and will continue to outstrip the human pace.
As I think most of the contributors to this discussion have noted, the "most powerful" learning algorithms/methods are going to depend on (a) the problem you are trying to solve and (b) the dataset(s) you have available. In my opinion, ANNs and GAs are both very limited and cumbersome, and I tend to use SVMs for most of *my* learning tools; now trying to integrate the latest graph-theory based AI into the SVMs. Again, the "most powerful" is WHATEVER WORKS, which I think someone already said.
The success is generally problem specific... You will need several other bearings than an error measure to really evaluate which one is preferable.. NNs may require prudent selection of variables, and tend to memorise the training. In RapidMiner there is a meta ANN which explores the optimal configuration automatically, so quite handy... We see GP showing better generalisation on the unseen data. Also parameters are intuitive and transparency is better in terms of what the results are rather than NN's interconnected nodes...
Re: interconnected nodes ... this is where neural evolution stands beyond "handcrafted" NNs. When evolution is allowed to manipulate the topology, e.g., by pruning and shortcut connections the relationships btwn independent and dependent vars can become "visual" and immediate. It becomes readily apparent which outputs are a function of which inputs. My own humble experiments proved this in Pilot Knowledge Engineering (decision: fly/no-fly), Post-Op Patient Routing (decision: IC, Gnl Hosp, Discharge), NASA O-Ring Failure Prediction, and with Fisher's Iris Data. The separation of relationships was obvious. Parameters were adapted by the NNs and/or evolved (real value crossover, mutation). In every case the NNs were developed with generalization as a primary criteria of 'fitness' calculations. One of the 'drives' was sparseness, e.g., m processors and n possible connections versus x actual connections. This particular drive in fitness is a work to continue ... over the holidays maybe ;-) It began as a measure of determinism (under-determined ... over-determined). I left off with the thought that determinism was too aggressive ... perhaps OK if the evolution is phased for learning acquisition first, followed by a topology minimizing phase. Background motivation was hardware implementation, e.g., FPGAs.
There is never an absolute powerful technique for all the problems and subsequently the data sets!
I agree this depend from the problem and how you formulate the problem and link it
to the approach. Moreover, different type of NN's and genetic algorithms will have different performances when applied on the same problem.
If we go back to the original motivation for ANN, viz. simulation of the brain in analogue format,. It would seem to me , the ANN's should produce a more general purpose solution. To carry the analogy further, we need a pre-processor to map the sensory inputs into an ANN. The question then is where should we put the evolutionary functions?
Of course utilization of ANN vs. GA depends on the problem. On the other hand GA can be used for optimizing both the architecture and weights of ANNs. There are multiple publications on combination of ANNs and GA.
Dear Nabil Belgasmi, In general GA is better in comparison to ANN. However, the judgement on this can only be dictated after learning the problem. ANN is good in general problem solution and optimization.
An artificial neural network (ANN) model is made up of various simple and highly interconnected computational elements. In general, the ANNs are of many types, but all of them have three things in common:
individual neurons (processing elements),
connections (topology), and a learning algorithm.
The processing element calculates the neuron transfer function of the summation of
weighted inputs.
As mentioned by many comments above, it always depends on the problem. I beleive this is called "no free lunch theorem" - all optimization algorithms will have the same performance if averaged across all optimization problems :)
Dear Nabil Belgasmi, The comparison of both depends on nature and complexity of the problems.
Comparing the absolute run time in GA and ANN programmings, ANN is superior because of fast NN algorithms. Large population makes GA more diversified than ANN. Choosing small population size and repeating experiments many times reduces the run time and may come to the range of ANN. GA is also better in sense that this technique is capable of predicting outcomes provides better inside into and better understanding.
The original question was GP vs. ANN. GP is not really the same as GA. GP works with trees of variable depth, i.e., during the "learning", the architecture is changing, whereas the ANN usually works with a fixed architecture. Moreover, a "node" in GP can have arbitrary functionality in principle, i.e., +, -, *, /, but also sin, cos, exp or whatever you wish. Thus, GP seems to be more powerful and flexible in principle. But when playing with GP, you immediately see how slow it is, when you use not only +, -, *. Already / may provide severe problems. Finding a simple GP approximation of a simple function, such as sin(3/x) is already extremely hard. Years ago, I wrote much more sophisticated algorithms based on GP, called it "Generalized GP (GGP)", but the GP community did not like this and the corresponding paper was rejected with silly arguments. In fact, GGP uses a fixed, user-defined architecture, includes some linear and non-linear optimization and many more. Finally, it drastically outperforms GP. More: http://alphard.ethz.ch/Hafner/ggp/gp.htm
Marcal ... evolution acts on the NN, therefore it is external to the NN proper. The NNs are manipulated "within" the environment that is the system which implements the evolution processes. One can "fit" a NN with operators with which the evolution system can effect the dynamics of the evolution process, including selection, reproduction, and mutation.
Exampe: I have a NN system which provides configurable features: learning algorithms, learming parameters, etc. (see home.sprintmail.com/~kalki/Dissertation.htm). All variable features of the NN are exposed to the evolution system by a subclassed EvolvableNetwork which is used within an "Entity" class which implements the interface required by the Evo system.
The Evo system provides the "environment" with configurable "pressures", rates of survival, reproduction, etc. The two aspects ... evolution and neural are standalone and independent. This problem-solving pattern can be applied to -any- problem for which a suitable manipulable representation can be made available.
Re: GP vs NN, nodes and functions ... my NN model uses sin, cos, symmetric sigmoid, asymmetric sigmoid, ranged sigmoid, ranged linear, and gaussian functions ... which the evolution system can select at will. Architecture is not fixed and can include shortcuts (across "layers") ... next iteration will add horizontal connections, e.g., within "layer".
Usually in this question represented about some approximate value calculation of sin (x) or other, Genetic Algorithm supports rather than ANN because each and every time need to fix some constraint like weight and error ratio approximation; every time in GA approximate error ratio decrease on some simple problems of computation not helps in NP Hard Problems .....
Check out IEEE pubs re: evolutionary computing and NP-Hard probs, e.g., Dimopoulos: Recent developments in evolutionary computation for manufacturing optimization: problems, solutions, and comparisons.
Dear Question followers, The comparison of both depends on nature and complexity of the problems. And comparison always require a benchmark.
Could we suggest any bench mark? If not then the comparison is as below:
1. Comparing the absolute run time in GA and ANN programmings, ANN is superior because of fast NN algorithms.
2. Large population makes GA more diversified than ANN. Choosing small population size and repeating experiments many times reduces the run time and may come to the range of ANN.
3. GA is also better in sense that this technique is capable of predicting outcomes provides better inside into and better understanding.
Both could be used together to develop a hybrid model. (e.x) HMM-GM
Also it depends on estimating window size and the time horizon. As the genetic prgramming usually used for short time period. Moreover, remember that the better data trained more forecasting accuracy will be.
Note: A couple of people have asked for e-copies of my project reports (e.g., dissertation). I do not have a ready e-copy of any of it. The dissertation is available from the Library of Congress (that's where I got Goldberg's dissertation :-) The dissertation defense (presentation, FWIW) is viewable at ...
home.sprintmail.com/~kalki/Dissertation.htm
Critical factors in the use of evolution include accurate and fully-modifiable representations of solutions, including problem-class-specific operators for recombination and mutation. These can be realized as functors for native data types and generic data structures (vectors, lists, etc.) => Generalized Evolution. That's my "thesis." For NNs and similar probs we can further implement operators which recombine and mutate structure or values, e.g., (pseudo) ...
EvolvableNN::Recombine (EvolvableNN&, const RecombinationParameters&)
// Probabilistically uses either or both structure and value methods
// determined by RecombinationParameters
EvolvableNN::RecombineStructure (EvolvableNN&, const RecombinationParameters&)
EvolvableNN::RecombineValues (EvolvableNN&, const RecombinationParameters&)
EvolvableNN::Mutate (const MutationParameters& mutation)
// Probabilistically uses either or both structure and value methods
// determined by MutationParameters
EvolvableNN::MutateStructure (const MutationParameters& mutation)
EvolvableNN::MutateValues (const MutationParameters& mutation)
Note: Value recombination and value mutation allow manipulation of all "independent variables" which can be on/off (e.g., self-adaptive learning parameters) or selected (e.g., transfer functions, weight-update rule, etc.).
@mikal Keenan, sorry for my delay in answering your comments. I am afraid that I am not familiar with feed-forward ANN just the back propagating type . I would appreciate any comments you would care to make concerning your evo methods vs. Deep Learning
That depends on the problem as many stated here. If both are applicable to your problem you may take a look at the drawbacks or implementation constraints for each.
For example ANN training may require you to normalize the weights of the ANN to prevent saturation. Or the configuration of the hidden layers.
Try to have deep investigation regarding the several types of GP or ANN.
ANN is more apprepiated especially for predicting the cutting parameter variation of tool condition in metal cutting the reason for cabability of high predicting percentage and high speed data processing.
The most powerful depends completely on the nature of the task or problem as they have problems that they are best suites for (though there are no specific classifications for such problems). Both GA and ANN yields weighted probabilities in time convergence (evolution) of feats within a dataset which the problem aims to estimate or predict at the point in time...in addendum to some of thepositions here stated such as Mikel Keenan and Afaq Ahmad
Dear Professor Arnold Ojugo
Very nice opinions, thanks a lot for showing your interest for GA and ANN models.
With best regards.
Marcal,
Thanks for the hint re: Deep Learning. Sounds ++interesting. I am just now beginning to study it (deeplearning.net). Re: BP and Feedforward NNs ... same thing :-) Feedforward = architecture and execution. BP = the training algorithm.
Hi Nabil,
I would think the opposite. As others said, NNs are universal approximators, according to the Kolmogorov theorem.
When you think of GP, it uses a limited alphabet (so a missing piece will not let you construct some programs), also probability to find a complex solution lowers with complexity if the task (size of program) - think of "detecting cats in youtube videos", how much code would it take to describe that.
On the other hand, GP is useful when you need to know (to verify, ...) the symbolic representation for the function, that is something a NN won't provide you (see task of "symbolic knowledge mining from neural nets")
"Power" seems like a poor metric because it consists of too many considerations. For instance, ANN is guaranteed in infinite time to find the "best" approximation. But then how "best" that approximation is depends on how many nodes, etc. So we might need to think in terms of both infinite time and infinite resources. Discounting a solution at infinite time doesn't give much "power" for anyone but the theorist and none of us have infinite resources. A GA may find an excellent approximation quite quickly and may give additional insight into the "rules" of the problem space which might translate into have a lot of "power". So it depends on the criteria for defining "power".
Much depends on the problem space: its topology, the chosen representation, the objective function. Even if they are isomorphic, different topologies, representations, and objective functions can differ drastically in how fast an ANN trains on a set of exemplars. Then there is the architecture of the ANN itself.
With either approach, we will "leave things out". If we insist otherwise, just create a look-up table of all exemplars and apply some measure of closeness (and perhaps interpolation) to get the answer. That way we don't "loose" any information. (But of course, we also don't smooth out any error either.)
So we're left in a state where we just "throw" different architectures at a problem and see "what sticks". From experience we see where GA works better than ANN (or vice versa). But that doesn't mean that a clever representation of the problem won't result in a "more powerful" solution.
In terms of representation, both approaches are universal: feedforward ANNs are universal function approximators, and GP can be used with Turing complete languages. Nevertheless, how easily they can describe a particular function or program is domain dependent, and must also take into account the training algorithm's ability to find particular solutions.
Arguably an advantage of GP is that there is more flexibility to choose a symbolic representation that matches the domain. A well chosen function set may then make it easier to find solutions and also make it easier to interpret solutions. However, a sub-symbolic representation (such as an ANN) might be more appropriate when there is little prior information available.
It's also worth pointing out that the distinction between GP and ANNs is becoming fuzzier these days, with many people using evolutionary algorithms to train ANNs and others looking at GP representations that are closer to ANNs.
As they are both used to solve a given problem? the question is : what is the powerful method is powerful for a given problem? So it depends totaly on the problem at hand.
ANN can be more powerful for a problem that GA cannot be and visversa.
Powerful is not every thing: the simplicity, of the implementation, the speed responce, the cost are also factors influencing the choice!
From the no-free-lunch theorem, if an algorithm performs on a set of problems it will be degraded on others. So ANN may perform well on a set of problems and so GA and GP.
Hyper-heuristics generates or chooses heuristics to solve a particular problem. It would be interested to see which one hyper-heuristic would choose to solve a variety of classes of problems. Perhaps it could identify which problem ANN solves best and which one GA would solve best.
There are probably many problems where a GA can be used to create quality NNs, then use back propagation to fine tune them
@Patricia,
The NFL theorem is theoretically interesting, but practically pretty void since none of us live in a world without strong priors. The simple fact that we can learn about the world suffices to show this (by the same NFL theorem, in fact).
So the question devolves to a much more difficult one of which *practical* problems succumb more readily to one technique or another. While this is difficult to answer for state of the art GP or BP algorithms, it is not hard to eliminate poor implementations.
Do note the confusion that others have exhibited between genetic algorithms (GA's, typically following Holland in adding cross-over to mutation) and genetic programming (applying GA's to program text in order to create new programs). The original poster asked about genetic programming.
The key point about many genetic programming implementations is that the search space
a) has essentially unbounded complexity
b) often introduces very complex discontinuities into the search
The first property requires pretty strong regularization. The second can make the search very difficult since discontinuities inherently make it difficult to reason about a neighborhood from a small view of that neighborhood. Backpropagation, especially on deep nets maintains sufficient expressive power for many problems and adds continuous optimization.
Speaking as a practitioner with some industrial experience and success, my strong view of standard practice for supervised learning is that strong teams follow roughly the following outline, possibly with slightly different preferred toolsets:
1) try some large subset of L1 regularized logistic regression, random forests, gradient boosting machines, ReLU's with dropout
2) if one of the options in (1) worked well and is deployable into production, do it and move on to next problem
3) otherwise, get insights from complex methods to help augment features for simpler methods
4) rack brain for new feature idea, expend political capital to get these features and goto (1)
The current state of the knowledge can show or not a different preference, but the future trend looks obvious, particularly in some network problems. What is more powerful: our intelligence (billions of neurons) or our genetics (chromosomes)? However, even with artificial intelligence, the use of optimization techniques requires programming.
Dear Engr. Nabil Belgasmi,
I consider Genetic Algorithm more powerful than Artificial Neural Network.
ANN is directed towards learning. while GA is directed to find a solution. So I think it depends on the problem at hand.
As we can create an ANN math expression using GP, It could be tested that if it is better to use only summation and a non-linear transfer function (conventional ANN) or use another functions such as sin , .... however it must be considered that simpler model could lead to more generalization ability.
But if you want to compare BP with GA, as I found out from empirical papers and my research in this area, there is no significant difference between neural networks' weights adaptation using GA and backpropagation algorithm. Although GA has some interesting features, for example it could help to design ANN topology and connection weights adaptation simultaneously. It could create more simpler ANNs topologies by means of a well-defined penalty function or Multi-Objective objective functions. our research show using these type of combination shows that Evolutionary Neural Networks could return significantly higher accuracy.
May I add one more point, for which ANN is much hated in the scientific community, and GP is slowly making ground in that space. It is its BLACK-BOX nature of ANN.
Though one gets a good match of the predicted vs. target values if the training is good enough, one can never figure out exactly HOW the output variable is related to the input variables in case of ANN.
This can be done away with-- to a great extent-- in GP. While ANN produce functions like y = a_1 \tahh{x_1} + a_2 \tanh{x_2} + ..., GP often comes out with a very simple and convincing looking formula like y = x_1 * x_2 + x_4 / x_7, or something like that !
Not only are the relations derived from GP very simple, but there is also an inherent feature selection mechanism built-in the GP algorithm (which is known to avoid over-fitting, a curse of most machine learning algorithms). Only what is to be tuned in GP, is the depth of the tree, that is the complexity of the formula. Shorter trees (yielding simplest formulae) are often simple and approximate, while longer ones are harder to explain in physical terms, and may over-fit. So, for the problem at hand, the depth of the most suitable tree should be found out by trial and error.
Partha,
One method that is commonly used to understand neural networks is to reverse engineer typical inputs for desired output. This technique works very well for image processing, for instance, and can be seen being used to good advantage in
http://research.google.com/archive/unsupervised_icml2012.html
As you point out, if a modeling technique produces a very simple model, it can be interpreted almost regardless of the technology. GP produces pretty formulae in such situations and neural networks produce very nice linear separators. Once you get beyond very simple models, however, it can be really hard to understand what is happening. When the input features are something nice like features, then you can make pictures of maximal excitation patterns, which is nice. When the input features are kind of abstract, this becomes really hard.
In that case, however, it isn't the fault of the modeling methodology so much as a problem with abstract features.
I very much agree with @Partha Dey and basically disagree with others that say that the model interpretability is pretty much the same and in practice it doesn't matter. From the beginning, the intention of the design of GP by Koza (taking aside that it was stated at the time that it will eventually be able to evolve full working code) was to simultaneously evolve the model structure as well as to meet the quality objective function. Which it is a pretty bold statement and difficult to achieve in the canonical version. However, the latest research has produced big leaps in terms of complexity reduction, quality and model generalization. In short, model structure is important in the context of the application, and this has been proved in a lot of different fields were it has been applied. But I said this with a pinch of salt. In a lot of applications, ANN is just the right choice to use.
I will focus just on one important aspect of your question: "What is the output of either of ANN or GP?" in some disciplines such as medicine it is highly recommended and wanted to search for solutions that can be analyzed and interpreted. Although there was made big progress in the interpretation of ANN outputs and the internal structure of given ANNs, the method is still expected to be a black box hard to interpret and to predict its output.
In the other hand, GP is generating evaluation structures using known functions. Therefore the understanding of the functioning of GP final optimized function is more or less straightforward.
Hence, whenever you insist on a high understanding of the solver, the opposite of the black box of ANN, you should go for GP.
In medicine, you solver become highly appreciated as each misclassification means potential harm or even loss of life.
So, even if you have a better ANN solution, you might be forced to at least check it by GP solution, or even be forced to go for suboptimal GP solution as this one is easy to analyze from within.