What software can be reliable to perform Species Distribution Modelling with presence/absence data?

15 January 2014 30 8K Report

I have a data set consisting of presence/absence records and I would like to perform SDM, I've already tried "R" but I would like to do so with a black box software as well.

Thomas Villmann

One good method could be self-organizing maps (SOMs - originally introduced by KOHONEN) if the data records are vectors. The dissimilarity measure between them can be (with some restrictions) be chosen accordingly, to represent also expert knowledge (see SOMs for non-standard metrics).

Best regards

Thomas Villmann

Jason D K Dzurisin

MAXENT (http://www.cs.princeton.edu/~schapire/maxent/) and OpenModeller (http://sourceforge.net/projects/openmodeller/files/) are both popular platforms for desktop and commandline use.

I am the program manager for a group developing a cloud-based SDM platform called SPACES that integrates a very large number of algorithms in a browser-based cloud-computing platform.

It just requires a free account at https://adapt.nd.edu then join the group below.

https://adapt.nd.edu/groups/cem/wiki/Tools

We also offer a virtualized openmodeller experience allowing users to run many parallel experiments at once on our servers.

Finally, we are hoping to create a one-stop gathering space for CEM/SDM folks to discuss, troubleshoot, and develop new platforms and algorithms so if you are interested please consider checking out the Discussion section of the working group (https://adapt.nd.edu/groups/cem/overview)

Best,

~Jason

Sara Varela

Hello Enrique,

there are two different questions here, one is the software that you use, and another one is the method that you use. If you work in R, you can use it to run any model (black box or not): as you have presences and absences I would recommend you using a classic GLMz model, because it is easy to discuss and interpret its results. But you can also try Maxent (using dismo library), or GAM, or BRT... It's up to you! :-)

best,

Sara

Andrew C. Yost

You might try Hyperniche http://home.centurytel.net/~mjm/npmrinhyperniche.htm

Rachel K. Guy

I am going to echo what others have said. It is actually a two part question you are asking, and really it depends on what your data looks like. First you need to know what kind of modeling approach you would like to use ( e.g. GLM, GAM, GLMM, BRT, Random Forests, etc, etc. I would say if you feel confident in your 'absence' dataset (that they are TRUE absences), then any of these would be appropriate methods. Each has its strengths and weaknesses, and there have been quite a few papers (many by Elith) comparing each of these approaches. If you have presence-only data, MaXENT has shown to be fairly robust in some comparison studies, but pseudoabsences can be generated for the other techniques as well.

Once you have decided which modeling approach is most beneficial for the dataset and your research question, then you can determine the best software package for your needs. R has a steep learning curve, but once you have the basics of the language then the packages that you can use are about as easy to use as MaXent. Still, it is pretty important to understand the statistics that are going into any of these methods, black box or not, as there's a real potential to 'abuse' the software which may result in inaccurate inferences.

Verónica Crespo-Pérez

Hi Enrique,

Maxent could be a good option. Also I recommend you try the BIOMOD platform for ensemble forecasting of species distributions. It is implemented in R and allows simulating distributions with several techniques, testing models with a wide range of approaches, and projecting distributions into different environmental conditions and dispersal functions. Finally, I would suggest you take a look at the dismo package in R.

Best,

Verónica

Jonathan Lenoir

Hi Enrique,

The last two answers from Rachel and Veronica basically contain what I was going to write down (cf. the modelling techniques and the software to use: dismo and BIOMOD are awesome packages that include all the modelling tools you would need in R for this kind of purpose).

However, I would add a comment about the modelling techniques you could use. Because you have presence/absence data I would definitely not use Maxent but GLMs (cf. logistic regressions), GAMs or BRTs instead. Indeed, using Maxent with presence/absence data is a bit of a shame. It would be like throwing a lot of your data (absence data) away. We actually got this criticism from referees for using Maxent with presence-absence data. So, you should know from now that using Maxent with presence-absence data will increase your chance to get this kind of comments from referees ;)

Best,

Jonathan

Matthew J. Rubino

Be aware that MaxEnt has numerous problems with assumptions being made for the modeler by the software if your absences are not true absences (i.e. you failed to sample areas where you did NOT expect to encounter your organism of study). Check out Royle et al. 2012 in Methods in Ecology and Evolution. However, if you have repeated visits in your sample, the best approach is occupancy analysis where detection probabilities are accounted for and absences figure into said probabilities. See Mackenzie et al.'s 2006 book "Occupancy Estimation and Modeling". Software used for these methods are varied - R packages to WinBUGs and the like.

Mohamed Hatim

Hi Enrique,

I would suggest that you use MAXENT, so far so good.

Best regards,

Sailesh Ranjitkar

Modelling approach with ensemble method is better than single algorithm modelling. Therefore using some platform which provide option for ensemble modelling is better. There are some such platforms such as dismo, biomod2, biodiversityR in R-programming language. These packages can be used for robust ensemble modelling, particularly BiodiversityR, which consists of 20 algorithms. Good Luck!

Benjamin Dale Best

The dismo R package makes several of the preferred methods (Maxent, BRT...) available and has helpful code showing much of the necessary data prep as well as analysis along with clear explanation in its vignettes:

dismo: Species distribution modeling

http://cran.r-project.org/web/packages/dismo/

For more background on species distribution modeling, Elith & Leathwick (2009) provides an excellent overview and extensive bibliography:

Species Distribution Models: Ecological Explanation and Prediction Across Space and Time

http://www.annualreviews.org/doi/abs/10.1146/annurev.ecolsys.110308.120159

Benjamin Dale Best

Here also is a workshop using Marine Geospatial Ecology Tools (MGET):

Species Distribution Modeling with MGET

http://mgel.env.duke.edu/mget/2013/12/27/species-distribution-modeling-workshops-in-nz/

You can see the available modeling methods (GAM, GLM, Linear Mixed Model, Random Forest, Tree) under the Statistics header as an expandable list here:

MGET Documentation

http://mgel.env.duke.edu/mget/documentation/arcgis-tutorial/

These do not require programming to use (but do need the commarcial ArcGIS software) and provide helpful documentation on available parameters.

Sean Tomlinson

There's not really a whole heap more to add to all the discussion that has taken place here. Obviously everybody who has weighed into this debate so far has a fairly firm grip of the statistics, processes and pacckages involved. There's an important point to bear in mind here, though, and this is what is the question you are asking. By building your data on a presence/absence data set you are inherently limited to correlative models (MAXENT, GLMs, Random Forest), rather than mechanistic models (Climex or Nichemapper). It's really worth coming to grips not just with the statistics going into these models, but the assumptions underpinning them, and how much your data or hypothesis-testing violates these assumption. After all, "all models are wrong, but some can be instructive." You want to be certain that your model is the least wrong and most instructive it can be. There are a number of good papers out there eg:

Elith, J., Kearney, M. and Phillips, S. (2010). The art of modelling range-shifting species. Methods Ecol. Evol. 1, 330–342.

Dormann, C. F., Schymanski, S. J., Cabral, J., Chuine, I., Graham, C., Hartig, F., Kearney, M., Morin, X., Römermann, C., Schröder, B., et al. (2011). Correlation and process in species distribution models: bridging a dichotomy. J. Biogeogr. 39, 2119–2131.

Buckley, L. B., Urban, M. C., Angilletta, M. J., Crozier, L. G., Rissler, L. J. and Sears, M. W. (2010). Can mechanism inform species’ distribution models? Ecol. Lett. 13, 1041–1054.

That said, if correlative modelling is all that you've got (which is fair, given how data-hungry mechanistic models are), your next question is how statisitcally skilled and experienced in modelling are you? I agree with the statements above that throwing out your absence data is a shame, but using it is fraught unless it is genuine absence data, not "absence of detection" data. After all, a model is only as good as the data on which it was constructed. Further, MAXENT is an easy, straightforward entry-level package and it is capable generate relatively accurate predictions, if given relatively accurate data and used with appropriate care to avoid over-fitting and other statistical glitches. As has been said, however, there are more involved approaches in other stats environments, like R, that will provide more involved outcomes.

I guess that my answer comes down to two things: understand your question and understand your algorithms, because in the end all the modelling approaches suggested can be the most appropriate under the right circumstances. These are the problems that I would be most cautious about when approaching any SDM or ENM question, not neccesarily which package I could get hold of or use.

Darren J. Kriticos

Hi Enrique,

What do you want from your modelling? Do you want a map predicting the relative abundance or habitat suitability for your species, or do you want some explanation for the patterns of observations? The answers to this question can strongly affect the suitability of different classes of models. One that is good at achieving one or these is not necessarily the best for the other. Also, what do you want to achieve with your black-box modelling in relation to the modelling you've already done? Are you looking to compare the performance of the methods? If so, then I'd strongly urge you to develop a deep understanding of the meaning of the metrics that you use to evaluate the models. Far too many papers quote kappa, TSS or AUC inappropriately. I presume because the packages provide them as an automatic output, and because the authors have not taken the time to understand the behaviour of these metrics. Depending on the origin and meaning of your presence and absence data, your study may be one of the few that can really take advantage of kappa and TSS.

I think that many of the comments above implicitly assume that your presence records represent established populations. Depending on the type of organism and it's propensity to develop ephemeral populations, you may want to consider exactly how you are framing your model.

I would like to echo the general issues of understanding your question and the algorithms, and as a corollary, generally avoid the use of ensembles.

Abel Chemura

In addition to what has already been said, i think its important to understand the level of the modeller and your background (especially regarding scripting). For an entry level modeller, MaxEnt is good but as already said, its a waste to use MaxEnt when you have both presence and (true) absence data, but it can give you a good start. In your question you asked about "software that is reliable" and i would say the reliability of the software depends on a number of other factors as already explained, so its not just a software issue, its a whole basket!

Gudeta Weldesemayat Sileshi

While I agree with most of the comments I would like to make a very important point. The software is secondary. The most important is the question to be answered. Secondly, presence-absence (site occupancy rate) is functionally related to abundance. If your question is to predict the relative abundance from presence-absence data you need to first establish the the occupancy-abundance relationship, which can be done even in excel if you have counts. If you want some explanation for patterns of observation, you could fit a generalized linear model (e.g. logistic regression) with covariates. That is ok if detection probability is close to 1 (i.e. species easy to count and false negatives are reasonably small) and data are not serially correlated (longitudinal). However, the problem with presence-absence data is that in many cases they are zero-inflated, i.e. some of the zeros are true others are false (false negatives). If data are longitudinal large true zeros are also possible if the species is absent in some part of the year and under some conditions (time or condition can be used as predictors). In such cases GLMs and GAMs are not that powerful because they will not be able to tell you the zero inflation and how much of the "absence" is true absence. They can also not handle serial correlation in longitudinal data.Therefore, a model that will be able to analyze the zero-inflation, account for serial correlation and the effect of covariates at the same time is more useful. Non-linear miced effect models are preferred. I have a paper in Ecological modelling that looks at all these. Please follow this link: http://www.sciencedirect.com/science/article/pii/S0304380009001963

I did most of my analysis in SAS and codes are available in my paper in Pedobiologia. Please follow this link http://www.sciencedirect.com/science/article/pii/S0031405608000073

In case if you do not have access to these paper I will send to you personally.

Luca Börger

The most important suggestions have already been made (i) think about your questions first and if your data are adequate to answer those; (ii) an ensemble of bad models still remains crap ;-)

This said, a new, potentially interesting method, has just been accepted in MEE:

http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12159/abstract

HTH

Enrique Sulbaran

I am really gladly surprised with such feedback! So that thank you very much to you all for taking the interest and the time to answer my question. I have a lot to read a to do know, thus I will be back after processing the information. Once again: THANKS!

Nandana Gunaratne

I think your problem deals with binary data analysis. SAS will provide a good solusion for your analysis. You can try probit prociedure or logistic procedure (in Proc statement)

Nani Maiya Sujakhu

You can deal with your problem using dismo package in R. You can check cran.r-project.org/web/packages/dismo/vignettes/sdm.pdf‎. You will find detail process to deal with presence-absence data, and R-codes step by step processing of such data as well as species distribution modelling. If you want easier ensemble modelling you can use BiodiversityR package.

Diederik Strubbe

Just a remark on ensemble modeling. You don’t have to ‘average all models’: R packages such as BIOMOD allow to run several techniques at once while specifying an evaluation threshold to decide which models are included in the ensemble model (which can be the average, or median, or other options as well). If you apply the default values of biomod2, for example, only models with an True Skill Statistic (TSS) > 0.7 are included in the ensemble predictions.

Biomod2 can be found here: https://r-forge.r-project.org/R/?group_id=302 , more information about this ensemble modelling technique here: http://www.will.chez-alice.fr/Software.html

Darren J. Kriticos

Ensemble Modelling

----------------------------

I agree with Diederik, that with ensemble modelling you do not have to average (or take the median or consensus values). However, I would go further and say that there are relatively few cases where ensemble modelling makes sense. It makes sense for some weather forecasting models where the feedback is fairly immediate. In this case you can run your models and check the next day whether they were correct or not. It also makes a lot of sense to run ensembles when you want to set up complicated model treatments to assess the sensitivity or uncertainty in the system. In this case you might set up a full- or partial factorial set of treatments. In these cases, the only "statistics" that are being assessed are the range - which has no significant assumptions associated with it. It may also be useful to look a a set of ensemble model runs to pick out the unusual, so as to understand the factor that that particular model has picked up. However, as soon as you start mixing models and applying statistics such as the mean, median, mode etc., I think we are usually stepping outside the bounds of good science. If you recall from Stats 101, these statistics have a set of assumptions - that the observations have been drawn from the same distribution, obeying the law of central tendency. I can't imagine how one would argue that the various packages in Biomod generate a set of data spanning a statistical distribution. I think the packages represent a set of transformations of a single set of observation data. I don't think there is any attracting force providing any degree of central tendency in the formulation of the algorithms beyond a relatively small degree of convergent evolution in the methods. Certainly, (with no disrespect to the authors of the ensemble modelling packages), the inclusion of algorithms in the packages is subjective, and probably reflects the ease with which they could be coded in R. This subjectivity strongly affects the meaning of the results of the ensemble, and compounds the significant errors and biases typically found in distribution data. Now, recall the "Garbage-in, garbage out" adage, the most important ability for correlative ecological modellers is to know the smell of garbage (the Economist). It seems to me that running an ensemble and mixing the results only serves to mask the smells. What happens if all or most of them are badly biased? We know that correlative models are unreliable when applied to novel situations. We know from first principles that such models should be extrapolated with extreme care,and there are several papers which demonstrate that they can get things badly wrong when applied for example across continents. If we run a suite of correlative models under the same extrapolation conditions and they tend to make the same mistakes, should we conclude that the results are consistently correct, or that the models are consistently wrong? If we do not take the trouble to test the results using independent data, we are naturally inclined to make the mistake of accepting the incorrect conclusion. So, at the end of the ensemble exercise, are you any more knowledgeable about the results and the factors shaping them, or just more (over)-confident in potentially biased or incorrect results?

Darren J. Kriticos

Binary Data Analysis

---------------------------

Enrique, if you treat your ecological modelling problem as simply a discrimination problem, as suggested by Nandana, you miss the fact that the data has complicated structural relationships with a set of environmental factors (think abou tthe law of the minimum and the law of tolerance). This is one of those situations where the model that fits the data best (describing the sampled species distribution), may have little generality, and no explanatory power.

Kuldip P Upasani

In my knowledge IDRISI Selva software by ClarkLabs provides us the Species Distribution Modeling/Habitat Assessment tool.

Very useful programme for Researchers. Must use the Land Change Modeling tool in that Habitat assessment is avaliable

Fabián Martínez-Hernández

For the presence/absence data I have used MARXAN. Is a program used for reserve selection networks and in my research group we have used it in some papers like the following:

Gap Analysis and selection of reserves for the threatened flora of eastern Andalusia, a hot spot in the eastern Mediterranean region

Antonio Mendoza-Fernández, Francisco Javier Pérez-García, José Miguel Medina-Cazorla, Fabián Martínez-Hernández, Juan Antonio Garrido-Becerra, Esteban Salmerón Sánchez, Juan Francisco Mota

Acta botanica Gallica: 2010; 157(4):749-767.

Eko Susilo

R is quite good...

Jocelyn L Aycrigg

There are some good books and papers you should consult that will provide you with valuable information in addition to all the comments above.

Araujo and Guisan. 2006. Five (or so) challenges for species distribution modeling. J. of Biogeography. 33:1677-1688.

Austin. 2007. Species distribution models and ecological theory: a critical assessment and some possible new approaches. Ecol. Modelling 200:1-9.

Boakes et al. 2010. Distorted views of biodiversity: spatial and temporal bias in species occurrence data. PLoS Biology 8:1-11.

Boria et al. 2014 Spatial filtering to reduce sampling bias can improve the performance of ecological nice models. Ecol. Modelling 275:73-77.

Drew et al. 2011. Predictve species habitat modellingin landscape ecology. Springer, New York.

Elith and Leathwick 2009. Species distribution models: ecological foundation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics 40:677-697.

Elith et al. 2006. Novel methods improve prediction of species' distributions from occurrence data. Ecography 29:129-151.

Franklin. 2009. Mapping species distributions spatial inference and prediction. Cambridge Univ. Press.

Franklin. 2013. Species distribution models in conservation biogeography: developments and challenges. Diversity and Distributions 19:1217-1223.

Graham et al. 2008. The influence of spatial errors in species occurrence data used in distribution models. J. of Applied Ecology 45:239-247.

Hegel et al. 2010. Current state of the art for statistical modeling of species distribtuions. In Cushman et al (editors). Spatial Complexity, Informatics, and Wildlife Conservation. Springer.

Kramer et al. 2013. The importance of correcting for sampling bias in MaxEnt species distribution models. Diversity and Distributions 19:1366-1379.

Wisz et al. 2008. Effects of sample size on the performance of species distribution models. Diversity and Distribution 14:763-773.

There is a lot of valuable information in these papers. Good luck with your project.

Jocelyn

Edward Vanden Berghe

Hi Enrique,

Your question certainly generated a very interesting conversation! If you're interested in a 'black box' approach you could try openModeller (http://openmodeller.sourceforge.net/), which combines a number of modelling approaches in one framework. But don't forget to look into the issues that were mentioned by the many interesting contributions here.

Trine Bekkby

I strongly agree with both Rachel and Bruce, and many other I am sure. It depends on what your data looks like and it depends on the question you want to answer. The scary part of SDMs is that to a manager, a really bad model can look just as nice as a good model, and a colorful map is always very popular to use and the probabilities and insecurities are often difficult to communicate. I personally prefer GMLs and GAMs if you have a good sample design with true absences, and applying these methods in R of quite user friendly, if you work in R, that is.

When it comes to black boxes, Maxent is a commonly used program. In preforms good when compared to other methods. However, it is scary to use the Maxent program in “neck-down” analyses, which I believe is common. You put everything in there, you keep all the default settings and you just hit the “run” button. And it looks great. But you should know that Maxent creates all kinds of interactions between you input variables, not regarding the ecological relevance of the interaction and without presenting you with the, if you do not look in the log file. And it creates pseudo-absences, 10 000 as a default, and these are not true absences. So you need to be careful.

I think the question of that you want to answer and what you want to use your model for is very important. Do you want to understand (and visualize) general patterns? Do you want to develop the best possible map of your area? Do you want to transfer you model to a different area or a different time? Do you want to improve sampling design? I would say that we have three different purposes: 1) Ecological response modelling, in order to model relationships in order to find the general patterns in the overall ecological response, 2) Spatial prediction modelling, in order to optimize the fit and 3) projective distribution modelling, in order to transfer you prediction to a different spatiotemporal setting.

Muhammad Naeem

I think Maxent modeling is the best method. Because presence only data and if you have limited data it gives appropriate results.

Badges
Science topic

Similar topics
Mathematics
Statistics

More Enrique Sulbaran's questions See All

Packages similar to SynthSR or SubtleMR?

Hi!, I am looking for package to improve typical "low" brain MRI resolutions and convert them to isotropic imaging for researching purposes. I tried SynthSR within freesurfer but it is not...

03 October 2023 5,239 1 View

How do you revitalize a defective centralized national blood transfusion service?

According to Ahmed (2022), "Urgent revitalization of the centralized national blood transfusion service, as the only permanent solution to the aforementioned battery of challenges, is hereby...

06 September 2023 166 1 View

In a factor analysis, what should happen with a factor that has only two items?

I have six factors but the last factor has only 2 items. What should I do? What is your experience?

21 May 2023 6,957 6 View

Hello, can I get helps to identified this algae?

Samples from freshwater, surface water.

20 March 2023 266 2 View

What is the reaction of enzymatic hydrolysis of steam-exploted sugarcane bagasse?

I´m investigating the production of glucose from sugarcane bagasse, and I really need information on the enzymatic hydrolysis reaction of sugarcane bagasse for reactor scale-up design.

18 February 2023 1,432 1 View

Is this result normal for RNA-seq data quality?

Hello everyone, I am checking the quality of some RNA-seq data with FASTQC and I am getting results that are not clear to me. Is this kind of result normal?

11 December 2022 6,283 5 View

How could a Vocational Guidance Assessment system be developed with Artificial Intelligence?

I am working on the development of Evaluation tools for the Vocational Guidance of secondary students. I developed instruments to assess personality, personal interests, academic and occupational...

22 October 2022 7,105 2 View

Setting mass yield constraints in RGibbs reactor (Aspen Plus)?

In Aspen Plus I have the following problem (see attached figure): I have a RGibbs block which calculates a distribution for a predifined set of products. However, I would like to go beyond this...

30 November 2021 5,540 1 View

What causes data not reproducible in IQF assays?

Hello everbody. I´m working with IQF fluorescence assays from Lifesensors Company. But I´m having problems when I try to reproduce the data by duplicate and triplicate. I would like to know if...

16 September 2021 9,583 1 View

Where do I can find Databases of cutting tools?

Hello advisors. I am glad to be here in this Community. If you allow me, I would ask if anyone of the members here know about Databases of cutting tools. Nowadays, I am working on a project and...

24 August 2021 6,609 3 View

How can I prepare virus for a TEM or SEM imaging?

I have virus (viral hemorrhagic septicemia virus) in suspension and the experiment will not involve cells. What level of TCID50 is preferred?

11 August 2024 3,115 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Is it possible to use the Fused Deposition Modeling (FDM) to additively manufacture interconnected porous structure generation of >100-200 micrometer?

Usually, additive manufacturing techniques like SEBM, SLS, and SLM are used for interconnected porous lattice structure generation with sizes of >100–200 micrometers. Can the Fused Deposition...

09 August 2024 7,892 0 View

How to define an anisotropic material with asymmetric elastic compliance/stiffness matrix in ANSYS APDL?

I need to model an anisotropic material in which the Poisson's ratio ν_12 ≠ ν_21 and so on. Therefore, the elastic compliance matrix wouldn't be a symmetric one. In ANSYS APDL, for TB,ANEL...

09 August 2024 5,048 2 View

How can I apply boundary conditions in an orthotropic steel deck numerical model using ABAQUS software?

I am trying to simulate vehicular loading on an orthotopic steel deck bridge section in ABAQUS software. The red arrow mark in the attached figure indicates the direction in which the vehicle will...

08 August 2024 719 0 View

Can you suggest reliable sources defining "3D mesh" and "3D city models"?

Dear fellow researchers, I am currently working on a paper where I need to provide a reliable reference that defines and distinguishes between 3D mesh models and 3D city models. Although I am...

06 August 2024 9,986 2 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

Please explain how the plastic input value should be considered from the true stress-strain curve for the bilinear elastoplastic material model ?

I am working on Abaqus/Explicit(Quasistatic ) for the deformation of the auxetic structure model. Please explain how the plastic input value should be considered from the true stress-strain curve...

05 August 2024 454 3 View