How to screen genomes for compositional studies?

More Ravi Kanth Reddy Sathi's questions See All

What is the formula to calculate the critical value of correlation?

I am calculating the correlation values between two data sets of size 257. I want to know what is the critical value of correlation for a sample size of 257. I tried searching on the web, but...

11 December 2013 9,253 18 View

What is the minimum coverage and how to identity percentage for protein domains?

I am trying to find out existence of protein domains in a set of sequences. I am using BLASTX for the task. I have made a BLASTX of my sequences with the ProDom sequences. I used an e-value cutoff...

10 November 2013 4,708 3 View

Ambiguity with bacterial ITS regions?

ITS regions are used for identifications of bacterial species. But while observing bacterial genomes it is seen that tRNA sequences are present in the ITS regions present between 16S rRNA and 23S...

09 October 2013 4,130 5 View

How to get consensus at ambiguous sites?

I have a set of aligned sequences in fasta format. I want to get consensus out of the alignment. In case of most of the sites one of the base is showing maximum occurrence. In case of sites where...

09 October 2013 3,887 2 View

Isn't a prokaryotic gene continuous?

I work on E. coli genomes and while going through the various genes present, I have seen (link) that in the coordinates area of the description it is suggested to join different regions of genome....

07 August 2013 9,166 7 View

Is there any free online access to a good system for researchers?

I am working at genomes level and I have a shortage of computational resources to do the tasks I want to. So could you please suggest where I can get free online access to a good system for my...

07 August 2013 9,290 5 View

How do I Rectify NCBI C++ Exception? ncbi::CMemoryFileSegment::CMemoryFileSegment()

I have tried running a command line blast. The query file is a multi fasta file containing 2600 sequences. It was made a BLASTX against a proteins sequences (ProDom) of size 2 GB (prodom.phr :...

07 August 2013 444 0 View

Should regulatory sequences be included in intergenic sequences?

I am working on finding the functional signatures/remains of regulatory regions, RNAs and proteins in the intergenic sequences (IGs) of E. coli K12 MG1655. I have a few doubts regarding the...

07 August 2013 1,899 1 View

How efficient are PERL and PHP in designing bioinformatics tools? What are the pros and cons of each and which is most commonly used and why?

Out of my experience I think the basic difference among both of them lies in speed and usage. PHP can be used for creating online tools. PERL can also be used, but PHP is more easy to handle to...

06 July 2013 513 4 View

Are there any cloud computation facilities for Bioinformatics work?

Can some one please throw some light on cloud facilities available to carry on Bioinformatics work. Also is there a possibility of using such services for free or for a nominal fee, for academic...

06 July 2012 5,093 19 View

Weak DAPI staining after immunohistochemistry - how to improve?

After immunohistochemistry of previously fixed in PFA and EtOH and then frozen 20 μm sections of zebrafish brain, DAPI staining is very weak (right) compared to the same sections stained without...

05 August 2024 9,637 2 View

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

I aim to be as skeptical as possible regarding whether a pair of orthologous genes results in the same phenotype in their different but related bacterial organisms under similar environmental...

05 August 2024 6,787 4 View

The Curse of Evolution and Complexity?

Brain and body mass together are positively correlated with lifespan (Hofman 1993). The duration of neural development is one of the best predictors of brain size, and conception is the best...

05 August 2024 6,247 3 View

Seeking Software Recommendations for SELEX NGS Data Analysis?

I am looking for software to help analyze SELEX NGS data, including alignment, sequence enrichment, and other related tasks. Can anyone recommend suitable tools or software? Best wishes, Waleed

30 July 2024 1,061 5 View

CAD File of human's & rat's respiratory airways ?

Dear all, I am working on particle deposition in human's & rat's respiratory airways using CFD and I am looking for the 3D CAD file for my simulations (STEP or IGES format). If somone has such...

29 July 2024 1,092 2 View

I am working on my Master's thesis on the biogeography of the genus Ruagea and I would like to ask, could someone help me to check whether my result?

I created a file with my outgroup and ingroup species using Beauti, ran it in BEAST, viewed it in Tracer, and then used TreeAnnotator to create a file that I imported into RASP. Could someone...

28 July 2024 2,979 1 View

Could you try using PeptiCloud and see if it's a useful tool for biology research?

PeptiCloud (www.pepticloud.com) is a bioinformatic platform that allows researchers to organize and share their data for their projects as well as collaborate with others in one place. Through...

28 July 2024 4,762 2 View

Do you know of any online international conferences that offer free discussions?

Do you know of any online international conferences that offer free discussions? I am looking for examples in the field of molecular ecology and DIY biology.

28 July 2024 6,501 0 View

Should the amount of DNA input used for ChIP-seq library preparation be matched between the control and experimental groups?

Hi all. As a beginner in ChIP-seq experiments, I hope you understand that the following questions might be somewhat basic. I am planning to perform ChIP-seq or MeDIP-seq analysis to investigate...

28 July 2024 6,938 1 View

Illustra™ MicroSpin™ G-25 columns what it is used for?

Hello, We found three packages of Illustra™ MicroSpin™ G-25 columns in the cabinet of an unused lab. They are very old but have never been opened. I have never used this kit before, and I couldn't...

25 July 2024 4,927 3 View

Philipp H. Schiffer

Hi there!

I am not sure if GC and length are really the best measurements to pick. What about aligning the the similar genomes and then go for genetic diversity? Look for coding snps and pick all the genomes that have a possible functional divergence.

Alessandro Giuliani

I hope this work by us in which we compared on the basis of SNP thru the agency of principal component analysis some genomes of mice strains could be of use for you:

http://www.la-press.com/a-novel-multi-scale-modeling-approach-to-infer-whole-genome-divergence-article-a3417-abstract

Marcin Golebiewski

Hi,

first you should determine what is going to be your "taxonomic resolution" - whether you want to have wide groups, such as phyla, or narrow ones (genera, species, strains). Your strategy would depend on that.

When you want to stick to deep level, narrow groups, I think that Phillip Schiffer's answer is highly relevant.

If you want broader groups, you should use some synthetic measure of similarity, such as average (over whole genomes) similarity of protein sequences coded by the genomes. The expression for this could be like that:

S = ((sum of similarities/number of genes in genome1)+(sum of similarities/number of genes in genome2))/2

Alternatively, you could use megablast similarity combined with coverage, but you have to average it over the two genomes:

S = (similarity * coverage w/respect to genome1 + similarity * coverage w/respect to genome2)/2 where similarity is averaged over all hits and coverage is summed for all hits.

The reason for this approach is that for highly divergent genomes SNP number stops being usable metric. Basically, it works well up to ~3% divergence - a level of bacterial "species".

Clifford G Clark

Another possibility would be to make two dendrograms, one based on the core genome and the second based on the entire genome with accessory genes included. Based on the dendrograms decide on what level of relatedness will provide you with a number of genomes for further analysis that you would consider to be reasonable given resources and time and choose genomes for further analysis that appear to be representative of key groupings. Comparing the phylogenies detected with core genomes to phylogenies with accessory genes included may help you to identify strains or isolates that have atypical accessory gene content, have higher than average horizontal gene transfer, or otherwise look interesting.

Glenn Soltes

I would wonder why you would want to remove any genomes but if you do you need to I would look at how things actually happen in evolution.

For a given group of genomes that are (sometimes arbitrarily) grouped into a species there generally aren't huge differences in overall genome sequence (if they are actually grouped realistically) so there will be no overall differences in GC content. However, within a group, individual isolates may have picked up plasmids or integrated stretches of DNA that may or may not have a divergent GC content. Some call these islands and they can be very interesting, but you could probably rationally exclude genomes that have differences only in these islands. Unfortunately these islands can cause enough phenotypic variation that isolates may be grouped as a different species even though the genomes are almost identical otherwise.

The take home message is that for each "species" you need to choose a reference genome and then do sequential and group whole genome alignments to see how the genomes actually differ and then make rational guesses based on the guessed evolutionary relationships of the genomes ...... very labour intensive. Then exclude.

What you are really doing is trying to model the evolutionary history of every species in your study group ... not a easy task.

For many groups species is a fairly undefined concept, You might want to pretend it doesn't even exist and model every genome as a independent entity and see how it comes out.

Vasilis J Promponas

Hi all,

Very interesting question, and nice posts.

Just a short comment @Glenn Soltes:

"For a given group of genomes that are (sometimes arbitrarily) grouped into a species there generally aren't huge differences in overall genome sequence (if they are actually grouped realistically) ..."

This may hold in general for %GC content, but variations in other measures, e.g. gene content, may be important and should be considered. Take for example E. coli strains, where several complete/near-complete genomes are publicly available: protein coding gene numbers vary from ~4200 to over 5000 ...

The way to choose a reference genome depends on the biological question asked and the resolution needed (cf. Marcin Golebiewski's comment). Taking into account that intergenic regions in prokaryotic genomes tend to be really short, I would rather rely on a measure related to coding regions/gene content in order to rationally pick a representative. For example, I would avoid choosing a strain if there was evidence that many of its genes were acquired by Horizontal Gene Transfer, unless of course HGT is relevant to my biological question.