I am trying to concatenate 8 different genes from 47 different organisms and prepare a phylogenetic tree. Can someone please suggest a good software to handle this data?
there are some steps not to forget, doing a phylogenetic analysis.
1) Align your sequences. Best software for the moment is "MAFFT", and the newest sofware of the moment is "Clustal Omega" but no unpartial benchmark has been made to compare those two. I suggest to stick to MAFFT, the great quality of this software has been proven by years of experience.
2) Clean your alignment of badly aligned parts, using GBlocks or Trimal (both are good, it is a long debate to compare them). The quality of your alignment (then phylogeny) will increase, and the computing time will decrease. It is particularly important considering your data.
3) Produce a maximum likelihood phylogeny, because your data can be handled by ML methods, you shouldn't even consider to use parsimony or distance method. You can use PhyML or RAxML, both programs are very close in quality. But considering existing benchmarks, you should choose PhyML for nucleic data and RAxML for proteic data.
To display your tree, I suggest Dendroscope, complete and performant.
I would suggest MAFFT. You can use it to align the sequences and then produce a phylo-tree. It's plenty of options to refine your taks. Take a look at :
For other analyses than likelihood, e.g., parsimony, try to use TNT (Tree analysis using New Technology). This program is designed for large data sets and is also freely available http://www.zmuc.dk/public/phylogeny/TNT/.
To be honest, 47 organisms and even several to many KB of sequence is not a big data set in terms of computational limitations. Many phylogenetic software packages can easily handle hundreds of organisms and many KB of sequence, as long as the computer itself has sufficient resources (RAM, temp file space) to handle it.
The limitation simply comes down to the complexity of the computational algorithm you wish to use, and how long you wish to wait for a result. Maximum Likelihood will take amongst the longest times to compute simply because it is one of the most computational demanding with no real heuristic approach possible. Parsimony would be much faster, since there are well known heuristic algorithms for Parsimony inference (an exhaustive Parsimony search will be much longer). And distance based methods would easily be the fastest as they are the simplest to compute (unfortunately, they are also the simplest evolutionary models as well, so not as robust for inferring phylogeny versus other methods).
Any of the programs in PHYLIP would readily handle your dataset, as would any of the other programs I am familiar with. So I would not let software drive your choice - pick the method of phylogenetic inference you think best, then find software that implements it. The mere size of your data set is not a limitation to computing trees.
Whether or not you want to concatenate the genes together into one run depends a LOT on what types of organisms and what types of genes you are analyzing. Nuclear and/or mitochondrial genes from closely related diploid organisms (plants, animals, fungi etc) is a different story than bacterial genes or genes from very distantly related eukaryotic species. We expect recombination between various lineages (subspecies, breeds) of the same species for example, but if you are looking at ducks, sparrows and eagles they have been reproductively isolated for so long that recombination is less of an issue.
If you do want to concatenate the genes into a single file for one analysis and one resulting phylogenetic tree, you can still use "data partitioning" methods to treat each gene differently (critical if you have some mitochondrial genes and some nuclear), and you may also want to consider partitioning the data in other ways (first, second and third codon positions, for example). The key to doing this type of work is the input file format. PAUP and many other programs use the NEXUS file format:
http://paup.csit.fsu.edu/nfiles.html
which allows you to store a lot of parameters for how to analyze the sequences along with the sequences.
Huang and Green discuss partitioning sites by rate of evolution, rather than by gene or codon position. They show that the context of each site (such as C followed by G which makes the C the target of DNA methylation enzymes) is important.
Dick G. Hwang and Phil Green.
Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution.
PNAS, 101(39):13994-4001, 2004.
The dataset is available at http://www.nisc.nih.gov/data.
But the route you take should be driven by the biology of the question(s) you are asking. Studying 47 species of salamanders that all shared a common ancestor just before the last ice age, and have been reproductively separated by hundreds of miles only after the ice retreated, is very different from studying birds that shared a common ancestor more than 100 million years ago. Studying bacteria or viruses brings up other sets of differences.
Looks very nice indeed, for an explanation of much of the biology. It does not explain very much about the methods, data partitioning, NEXUS files, and so on. And nothing in biology is "one size fits all". Coalescent issues can be huge in species that rather recently shared a common ancestor (human, chimpanzee, gorilla, neanderthal, Denisova) but far less important for more distantly related taxa (human, macaque, dog, whale, seal, aardvark).
there are some steps not to forget, doing a phylogenetic analysis.
1) Align your sequences. Best software for the moment is "MAFFT", and the newest sofware of the moment is "Clustal Omega" but no unpartial benchmark has been made to compare those two. I suggest to stick to MAFFT, the great quality of this software has been proven by years of experience.
2) Clean your alignment of badly aligned parts, using GBlocks or Trimal (both are good, it is a long debate to compare them). The quality of your alignment (then phylogeny) will increase, and the computing time will decrease. It is particularly important considering your data.
3) Produce a maximum likelihood phylogeny, because your data can be handled by ML methods, you shouldn't even consider to use parsimony or distance method. You can use PhyML or RAxML, both programs are very close in quality. But considering existing benchmarks, you should choose PhyML for nucleic data and RAxML for proteic data.
To display your tree, I suggest Dendroscope, complete and performant.
One step that could be added to Jean-François' advice, is to use DAMBE or another tool to check a few aspects of the quality of the data. And perhaps also SimPlot or Splitstree or another tool which can evaluate how much recombination, horizontal gene transfer or related processes have contributed to non-tree evolution.
In DAMBE (which also includes ML, NJ and many other tree-building methods, and the ability to select subsets of you data such as first, second and/or third codon positions) under the graphics pull-down menu is a "plot transitions and transversions vs nucleotide distances" menu choice which produces plots such as the ones I have attached here from vertebrate mitochondrial complete genomes, http://figshare.com/articles/Vertebrate_Mitochondrial_DNA_Phylogeny/692165 with data from
My point with this example, is not that either the data or the method is "wrong" but that the ability to compute some branches in the tree decreases as the distances increase. Genes in the nucleus (and fossils and other data) can be used to infer the date that the mammals split from a common ancestor relative to the date when birds and reptiles and amphibians evolved. But mitochondrial DNA, whether it is complete mt genomes or a cleaned up subset of that data, loose power to make accurate inferences because the data is beyond saturation with mutations, as the mitochondria evolve roughly 10-fold faster than genes in the nucleus. Mitochondrial DNA is great for genus/species level comparisons, not so good for class/order/family comparisons.
Many people who study vertebrate evolution will find my example here to be a bit of a "no brainer". But people who study bacteria or viruses often fail to realize just how vast the DNA distances are in their data sets, and how those distances can complicate the inference of ancient events. Putting the bacterial or viral data alongside data from a set with a more solid fossil record, helps to put it in perspective.
In my experience, aligning sequences with MAFFT or even CLUSTAL is fairly trivial, and there are few regions which need to be cleaned up or removed, when the distances are in a reasonable range. When methods beyond MAFFT are needed, or the alignment has many gappy regions (my vertebrate mitochondrial genome alignment is freely downloadable from the figshare link above) it is usually an indication that the genetic distances are too large to get accurate results without heroic efforts.
Well, in my experience, it is very usual to have very long and gappy alignments, which can contain some conserved blocks. After a TrimAl or GBlocks cleaning, the result can be very well adapted to a phylogenetic reconstruction. The "Alignment -> cleaning -> phylogeny (-> reconciliation)" workflow is very standard, to fill some full genome family databases, or to study carefully precise families. And the gap proportion of the "just after mafft" alignment is very frequently higher than anyone could intend.
A great deal of eukaryotic and prokaryotic evolution is due to insetions and deletions, which result in regions of DNA that are not homologous. So, Jean-François is absolutely correct that in many organisms there are homologous and non-homologous regions interspersed. Mitochondrial DNA has less chance for transposons, endogenous retroviruses, LINEs, SINEs and other forms of "selfish DNA" to insert. I was not suggesting that clean up of an alignment is never needed, but only that if the alignment problem is very difficult it can be an indication that the regions which are aligned are quite distant from each other, and perhaps saturated with mutations. Opening the alignment with DAMBE (before and/or after the clean-up) and clicking a button to make a graph, is worth doing.