I have a data set consisting of presence/absence records and I would like to perform SDM, I've already tried "R" but I would like to do so with a black box software as well.
One good method could be self-organizing maps (SOMs - originally introduced by KOHONEN) if the data records are vectors. The dissimilarity measure between them can be (with some restrictions) be chosen accordingly, to represent also expert knowledge (see SOMs for non-standard metrics).
MAXENT (http://www.cs.princeton.edu/~schapire/maxent/) and OpenModeller (http://sourceforge.net/projects/openmodeller/files/) are both popular platforms for desktop and commandline use.
I am the program manager for a group developing a cloud-based SDM platform called SPACES that integrates a very large number of algorithms in a browser-based cloud-computing platform.
It just requires a free account at https://adapt.nd.edu then join the group below.
https://adapt.nd.edu/groups/cem/wiki/Tools
We also offer a virtualized openmodeller experience allowing users to run many parallel experiments at once on our servers.
Finally, we are hoping to create a one-stop gathering space for CEM/SDM folks to discuss, troubleshoot, and develop new platforms and algorithms so if you are interested please consider checking out the Discussion section of the working group (https://adapt.nd.edu/groups/cem/overview)
there are two different questions here, one is the software that you use, and another one is the method that you use. If you work in R, you can use it to run any model (black box or not): as you have presences and absences I would recommend you using a classic GLMz model, because it is easy to discuss and interpret its results. But you can also try Maxent (using dismo library), or GAM, or BRT... It's up to you! :-)
I am going to echo what others have said. It is actually a two part question you are asking, and really it depends on what your data looks like. First you need to know what kind of modeling approach you would like to use ( e.g. GLM, GAM, GLMM, BRT, Random Forests, etc, etc. I would say if you feel confident in your 'absence' dataset (that they are TRUE absences), then any of these would be appropriate methods. Each has its strengths and weaknesses, and there have been quite a few papers (many by Elith) comparing each of these approaches. If you have presence-only data, MaXENT has shown to be fairly robust in some comparison studies, but pseudoabsences can be generated for the other techniques as well.
Once you have decided which modeling approach is most beneficial for the dataset and your research question, then you can determine the best software package for your needs. R has a steep learning curve, but once you have the basics of the language then the packages that you can use are about as easy to use as MaXent. Still, it is pretty important to understand the statistics that are going into any of these methods, black box or not, as there's a real potential to 'abuse' the software which may result in inaccurate inferences.
Maxent could be a good option. Also I recommend you try the BIOMOD platform for ensemble forecasting of species distributions. It is implemented in R and allows simulating distributions with several techniques, testing models with a wide range of approaches, and projecting distributions into different environmental conditions and dispersal functions. Finally, I would suggest you take a look at the dismo package in R.
The last two answers from Rachel and Veronica basically contain what I was going to write down (cf. the modelling techniques and the software to use: dismo and BIOMOD are awesome packages that include all the modelling tools you would need in R for this kind of purpose).
However, I would add a comment about the modelling techniques you could use. Because you have presence/absence data I would definitely not use Maxent but GLMs (cf. logistic regressions), GAMs or BRTs instead. Indeed, using Maxent with presence/absence data is a bit of a shame. It would be like throwing a lot of your data (absence data) away. We actually got this criticism from referees for using Maxent with presence-absence data. So, you should know from now that using Maxent with presence-absence data will increase your chance to get this kind of comments from referees ;)
Be aware that MaxEnt has numerous problems with assumptions being made for the modeler by the software if your absences are not true absences (i.e. you failed to sample areas where you did NOT expect to encounter your organism of study). Check out Royle et al. 2012 in Methods in Ecology and Evolution. However, if you have repeated visits in your sample, the best approach is occupancy analysis where detection probabilities are accounted for and absences figure into said probabilities. See Mackenzie et al.'s 2006 book "Occupancy Estimation and Modeling". Software used for these methods are varied - R packages to WinBUGs and the like.
Modelling approach with ensemble method is better than single algorithm modelling. Therefore using some platform which provide option for ensemble modelling is better. There are some such platforms such as dismo, biomod2, biodiversityR in R-programming language. These packages can be used for robust ensemble modelling, particularly BiodiversityR, which consists of 20 algorithms. Good Luck!
The dismo R package makes several of the preferred methods (Maxent, BRT...) available and has helpful code showing much of the necessary data prep as well as analysis along with clear explanation in its vignettes:
dismo: Species distribution modeling
http://cran.r-project.org/web/packages/dismo/
For more background on species distribution modeling, Elith & Leathwick (2009) provides an excellent overview and extensive bibliography:
Species Distribution Models: Ecological Explanation and Prediction Across Space and Time
There's not really a whole heap more to add to all the discussion that has taken place here. Obviously everybody who has weighed into this debate so far has a fairly firm grip of the statistics, processes and pacckages involved. There's an important point to bear in mind here, though, and this is what is the question you are asking. By building your data on a presence/absence data set you are inherently limited to correlative models (MAXENT, GLMs, Random Forest), rather than mechanistic models (Climex or Nichemapper). It's really worth coming to grips not just with the statistics going into these models, but the assumptions underpinning them, and how much your data or hypothesis-testing violates these assumption. After all, "all models are wrong, but some can be instructive." You want to be certain that your model is the least wrong and most instructive it can be. There are a number of good papers out there eg:
Elith, J., Kearney, M. and Phillips, S. (2010). The art of modelling range-shifting species. Methods Ecol. Evol. 1, 330–342.
Dormann, C. F., Schymanski, S. J., Cabral, J., Chuine, I., Graham, C., Hartig, F., Kearney, M., Morin, X., Römermann, C., Schröder, B., et al. (2011). Correlation and process in species distribution models: bridging a dichotomy. J. Biogeogr. 39, 2119–2131.
Buckley, L. B., Urban, M. C., Angilletta, M. J., Crozier, L. G., Rissler, L. J. and Sears, M. W. (2010). Can mechanism inform species’ distribution models? Ecol. Lett. 13, 1041–1054.
That said, if correlative modelling is all that you've got (which is fair, given how data-hungry mechanistic models are), your next question is how statisitcally skilled and experienced in modelling are you? I agree with the statements above that throwing out your absence data is a shame, but using it is fraught unless it is genuine absence data, not "absence of detection" data. After all, a model is only as good as the data on which it was constructed. Further, MAXENT is an easy, straightforward entry-level package and it is capable generate relatively accurate predictions, if given relatively accurate data and used with appropriate care to avoid over-fitting and other statistical glitches. As has been said, however, there are more involved approaches in other stats environments, like R, that will provide more involved outcomes.
I guess that my answer comes down to two things: understand your question and understand your algorithms, because in the end all the modelling approaches suggested can be the most appropriate under the right circumstances. These are the problems that I would be most cautious about when approaching any SDM or ENM question, not neccesarily which package I could get hold of or use.
What do you want from your modelling? Do you want a map predicting the relative abundance or habitat suitability for your species, or do you want some explanation for the patterns of observations? The answers to this question can strongly affect the suitability of different classes of models. One that is good at achieving one or these is not necessarily the best for the other. Also, what do you want to achieve with your black-box modelling in relation to the modelling you've already done? Are you looking to compare the performance of the methods? If so, then I'd strongly urge you to develop a deep understanding of the meaning of the metrics that you use to evaluate the models. Far too many papers quote kappa, TSS or AUC inappropriately. I presume because the packages provide them as an automatic output, and because the authors have not taken the time to understand the behaviour of these metrics. Depending on the origin and meaning of your presence and absence data, your study may be one of the few that can really take advantage of kappa and TSS.
I think that many of the comments above implicitly assume that your presence records represent established populations. Depending on the type of organism and it's propensity to develop ephemeral populations, you may want to consider exactly how you are framing your model.
I would like to echo the general issues of understanding your question and the algorithms, and as a corollary, generally avoid the use of ensembles.
In addition to what has already been said, i think its important to understand the level of the modeller and your background (especially regarding scripting). For an entry level modeller, MaxEnt is good but as already said, its a waste to use MaxEnt when you have both presence and (true) absence data, but it can give you a good start. In your question you asked about "software that is reliable" and i would say the reliability of the software depends on a number of other factors as already explained, so its not just a software issue, its a whole basket!
While I agree with most of the comments I would like to make a very important point. The software is secondary. The most important is the question to be answered. Secondly, presence-absence (site occupancy rate) is functionally related to abundance. If your question is to predict the relative abundance from presence-absence data you need to first establish the the occupancy-abundance relationship, which can be done even in excel if you have counts. If you want some explanation for patterns of observation, you could fit a generalized linear model (e.g. logistic regression) with covariates. That is ok if detection probability is close to 1 (i.e. species easy to count and false negatives are reasonably small) and data are not serially correlated (longitudinal). However, the problem with presence-absence data is that in many cases they are zero-inflated, i.e. some of the zeros are true others are false (false negatives). If data are longitudinal large true zeros are also possible if the species is absent in some part of the year and under some conditions (time or condition can be used as predictors). In such cases GLMs and GAMs are not that powerful because they will not be able to tell you the zero inflation and how much of the "absence" is true absence. They can also not handle serial correlation in longitudinal data.Therefore, a model that will be able to analyze the zero-inflation, account for serial correlation and the effect of covariates at the same time is more useful. Non-linear miced effect models are preferred. I have a paper in Ecological modelling that looks at all these. Please follow this link: http://www.sciencedirect.com/science/article/pii/S0304380009001963
I did most of my analysis in SAS and codes are available in my paper in Pedobiologia. Please follow this link http://www.sciencedirect.com/science/article/pii/S0031405608000073
In case if you do not have access to these paper I will send to you personally.
The most important suggestions have already been made (i) think about your questions first and if your data are adequate to answer those; (ii) an ensemble of bad models still remains crap ;-)
This said, a new, potentially interesting method, has just been accepted in MEE:
I am really gladly surprised with such feedback! So that thank you very much to you all for taking the interest and the time to answer my question. I have a lot to read a to do know, thus I will be back after processing the information. Once again: THANKS!
I think your problem deals with binary data analysis. SAS will provide a good solusion for your analysis. You can try probit prociedure or logistic procedure (in Proc statement)
You can deal with your problem using dismo package in R. You can check cran.r-project.org/web/packages/dismo/vignettes/sdm.pdf. You will find detail process to deal with presence-absence data, and R-codes step by step processing of such data as well as species distribution modelling. If you want easier ensemble modelling you can use BiodiversityR package.
Just a remark on ensemble modeling. You don’t have to ‘average all models’: R packages such as BIOMOD allow to run several techniques at once while specifying an evaluation threshold to decide which models are included in the ensemble model (which can be the average, or median, or other options as well). If you apply the default values of biomod2, for example, only models with an True Skill Statistic (TSS) > 0.7 are included in the ensemble predictions.
Biomod2 can be found here: https://r-forge.r-project.org/R/?group_id=302 , more information about this ensemble modelling technique here: http://www.will.chez-alice.fr/Software.html
I agree with Diederik, that with ensemble modelling you do not have to average (or take the median or consensus values). However, I would go further and say that there are relatively few cases where ensemble modelling makes sense. It makes sense for some weather forecasting models where the feedback is fairly immediate. In this case you can run your models and check the next day whether they were correct or not. It also makes a lot of sense to run ensembles when you want to set up complicated model treatments to assess the sensitivity or uncertainty in the system. In this case you might set up a full- or partial factorial set of treatments. In these cases, the only "statistics" that are being assessed are the range - which has no significant assumptions associated with it. It may also be useful to look a a set of ensemble model runs to pick out the unusual, so as to understand the factor that that particular model has picked up. However, as soon as you start mixing models and applying statistics such as the mean, median, mode etc., I think we are usually stepping outside the bounds of good science. If you recall from Stats 101, these statistics have a set of assumptions - that the observations have been drawn from the same distribution, obeying the law of central tendency. I can't imagine how one would argue that the various packages in Biomod generate a set of data spanning a statistical distribution. I think the packages represent a set of transformations of a single set of observation data. I don't think there is any attracting force providing any degree of central tendency in the formulation of the algorithms beyond a relatively small degree of convergent evolution in the methods. Certainly, (with no disrespect to the authors of the ensemble modelling packages), the inclusion of algorithms in the packages is subjective, and probably reflects the ease with which they could be coded in R. This subjectivity strongly affects the meaning of the results of the ensemble, and compounds the significant errors and biases typically found in distribution data. Now, recall the "Garbage-in, garbage out" adage, the most important ability for correlative ecological modellers is to know the smell of garbage (the Economist). It seems to me that running an ensemble and mixing the results only serves to mask the smells. What happens if all or most of them are badly biased? We know that correlative models are unreliable when applied to novel situations. We know from first principles that such models should be extrapolated with extreme care,and there are several papers which demonstrate that they can get things badly wrong when applied for example across continents. If we run a suite of correlative models under the same extrapolation conditions and they tend to make the same mistakes, should we conclude that the results are consistently correct, or that the models are consistently wrong? If we do not take the trouble to test the results using independent data, we are naturally inclined to make the mistake of accepting the incorrect conclusion. So, at the end of the ensemble exercise, are you any more knowledgeable about the results and the factors shaping them, or just more (over)-confident in potentially biased or incorrect results?
Enrique, if you treat your ecological modelling problem as simply a discrimination problem, as suggested by Nandana, you miss the fact that the data has complicated structural relationships with a set of environmental factors (think abou tthe law of the minimum and the law of tolerance). This is one of those situations where the model that fits the data best (describing the sampled species distribution), may have little generality, and no explanatory power.
For the presence/absence data I have used MARXAN. Is a program used for reserve selection networks and in my research group we have used it in some papers like the following:
Gap Analysis and selection of reserves for the threatened flora of eastern Andalusia, a hot spot in the eastern Mediterranean region
Antonio Mendoza-Fernández, Francisco Javier Pérez-García, José Miguel Medina-Cazorla, Fabián Martínez-Hernández, Juan Antonio Garrido-Becerra, Esteban Salmerón Sánchez, Juan Francisco Mota
There are some good books and papers you should consult that will provide you with valuable information in addition to all the comments above.
Araujo and Guisan. 2006. Five (or so) challenges for species distribution modeling. J. of Biogeography. 33:1677-1688.
Austin. 2007. Species distribution models and ecological theory: a critical assessment and some possible new approaches. Ecol. Modelling 200:1-9.
Boakes et al. 2010. Distorted views of biodiversity: spatial and temporal bias in species occurrence data. PLoS Biology 8:1-11.
Boria et al. 2014 Spatial filtering to reduce sampling bias can improve the performance of ecological nice models. Ecol. Modelling 275:73-77.
Drew et al. 2011. Predictve species habitat modellingin landscape ecology. Springer, New York.
Elith and Leathwick 2009. Species distribution models: ecological foundation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics 40:677-697.
Elith et al. 2006. Novel methods improve prediction of species' distributions from occurrence data. Ecography 29:129-151.
Franklin. 2009. Mapping species distributions spatial inference and prediction. Cambridge Univ. Press.
Franklin. 2013. Species distribution models in conservation biogeography: developments and challenges. Diversity and Distributions 19:1217-1223.
Graham et al. 2008. The influence of spatial errors in species occurrence data used in distribution models. J. of Applied Ecology 45:239-247.
Hegel et al. 2010. Current state of the art for statistical modeling of species distribtuions. In Cushman et al (editors). Spatial Complexity, Informatics, and Wildlife Conservation. Springer.
Kramer et al. 2013. The importance of correcting for sampling bias in MaxEnt species distribution models. Diversity and Distributions 19:1366-1379.
Wisz et al. 2008. Effects of sample size on the performance of species distribution models. Diversity and Distribution 14:763-773.
There is a lot of valuable information in these papers. Good luck with your project.
Your question certainly generated a very interesting conversation! If you're interested in a 'black box' approach you could try openModeller (http://openmodeller.sourceforge.net/), which combines a number of modelling approaches in one framework. But don't forget to look into the issues that were mentioned by the many interesting contributions here.
I strongly agree with both Rachel and Bruce, and many other I am sure. It depends on what your data looks like and it depends on the question you want to answer. The scary part of SDMs is that to a manager, a really bad model can look just as nice as a good model, and a colorful map is always very popular to use and the probabilities and insecurities are often difficult to communicate. I personally prefer GMLs and GAMs if you have a good sample design with true absences, and applying these methods in R of quite user friendly, if you work in R, that is.
When it comes to black boxes, Maxent is a commonly used program. In preforms good when compared to other methods. However, it is scary to use the Maxent program in “neck-down” analyses, which I believe is common. You put everything in there, you keep all the default settings and you just hit the “run” button. And it looks great. But you should know that Maxent creates all kinds of interactions between you input variables, not regarding the ecological relevance of the interaction and without presenting you with the, if you do not look in the log file. And it creates pseudo-absences, 10 000 as a default, and these are not true absences. So you need to be careful.
I think the question of that you want to answer and what you want to use your model for is very important. Do you want to understand (and visualize) general patterns? Do you want to develop the best possible map of your area? Do you want to transfer you model to a different area or a different time? Do you want to improve sampling design? I would say that we have three different purposes: 1) Ecological response modelling, in order to model relationships in order to find the general patterns in the overall ecological response, 2) Spatial prediction modelling, in order to optimize the fit and 3) projective distribution modelling, in order to transfer you prediction to a different spatiotemporal setting.