We are dealing with Capsicum and also soon with sunflower transcriptomes. I am not updated in the novelities of assemblers. We will try Trinity and Newbler (if I can make it run in my Mac!). Does anyone have suggestions?
Thank you Vladimir, in oír case we will be using both, a previous assembly of 454 reads and new reads of Illumina (paired). We will try MIRA and post the results.
My personal experience is that Newbler (and not Trinity) works really good for 454 reads while Trinity does a good job with illumina ones. Mixing the reads does not improve the results with Trinity.
While pricey and only moderately tweakable, we have had really good luck with CLC Genomics Workbench. It has worked well assembling Illumina, 454 and Ion Torrent sequence data as discrete or hybrid assemblies. We have compared assembly results to that of other assemblers and it is as good as MIRA and better than others. It is also VERY fast. Cost is the biggest drawback.
For hybrid assembly (454 & Illumina single-end), we assemble 454 with Mira, and Illumina reads with Velvet/Oases. We perform several runs of Velvet/Oases for a range of k-mer values and introduce cleaned 454 reads into each run to improve the assembly. We then merge the 454 MIRA assembly and all the Velvet/Oases assemblies (CAP3) into a unified super-assembly.
Thank you very much to all the colleges that answered; we will try different strategies and then post a brief summary. Our first tries will be in an iMac with 3.4 GHz Intel Core i7 and 16 GB 1333 MHz DDR3; if not enough, will use Langebio's cluster. Anybody using similar hardware (iMac)? I have problems to run Newbler…
I think it will depend on the tools chosen, and the amount of sequence, but my guess is that RAM will be most likely limiting factor, and 16GB may be too small, depending on the depth.
An important consideration for denovo assembly is pre-filtering of the reads before attempting an assembly.
Reads containing errors can be identified as they tend to be singletons. Duplicate reads can also be eliminated as they don't add to the information for the assembler.
The khmer tool and digital normalization approach developed by C. Titus Brown perform these tasks very well.
Links for paper and code here:
http://ged.msu.edu/papers/2012-diginorm/
Applying these pre-filters helps to limit memory usage as well as reducing the redundancy in the final transcriptome assembly.
Btw, MIRA comes with the miramem tool, which roughly estimates the amount of RAM needed for assembly. You will just have to answer a few questions (type of reads, their number, etc.)
I second the recommendation of Olivier Armant for Oases/Velvet, but I also have heard that CLC Bio's assembly pipeline is REALLY easy and works pretty well.
Any of these assemblers, though, will require lots of memory. If this is a huge data set (e.g., and Illumina run), you can significantly reduce the memory requirements, for any of these assemblers, as well, using the "digital normalization" technique via a Python software package called khmer. There's a tutorial here:
and a paper discussing the theory and practical results of this diginorm pre-processing step here:
http://arxiv.org/abs/1203.4802
We've had some huge successes using this pre-processing step prior to assembling several Illumina HiSeq lanes combined into one data set (~300-400 million reads, 100 bp each) both for metagenomics and transcriptomics.
To add my two cents... your hardware makes things really tricky. Try to get access to a computer with more RAM. That of course depends on the amounts reads/technology used, but still. We have run a dozen or so independent transcriptomes here in the last few years, mostly Illumina data. Oases/Velvet is pretty good, so is Trinity. Our main stay is CLC, especially since version 5.5 - usually yields the best results, is fastest and uses the least resources as well. If I remember correctly there is a trial version of it available, so if you only have this dataset you can give it a shot without buying the (expensive) software. But the best advice ? Try several assemblers. There is no "one solution for all" software, at least not yet. Assemble with various programs, using several options. Then assess the quality using length distribution plots and by randomly checking a few dozen contigs, both by Blast and by looking at the distribution of reads within. If you use this approach, it will cost a few days/weeks of time, but you will safe tremendously downstream.
For prokaryote genomes we have used Velvet/OASES, it is really good! How Paul McGettigan says, apply a filter (quality filter and trimme the sequeces) is very important. For Roche 454 and Ion Torrent to apply the quality filter I recommend https://sourceforge.net/projects/qualevaluato/files/Quality%20Long%20Reads/