In WGS alignment, how much more "unique" is a 500bp read than a 100bp read?

More Kt Pickard's questions See All

Dad's a Neanderthal.

I was discussing my Neanderthal results from 23andMe with my 12 year-old son. He said, "Dad, how can you be only 2.8% related to Neanderthal, but 98% related to a chimpanzee?" I could not answer...

31 December 2011 1,805 21 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

How to confirm the site-directed mutagenesis result without performing NGS?

I'm cloning a fragment of 3200 nts into plasmid. The cloning was successful, however, 02 amino acids were mutated. Now I want to fix these 02 aa by site-directed mutagenesis technique using...

08 August 2024 4,645 2 View

I can't see the ssDNA band after performing asymmetric PCR. Is there any way to do this?

After performing symmetric PCR, PCR purification was performed. Afterwards, asymmetric PCR was performed using the PCR purification product as a template, but no ssDNA band was confirmed in the...

08 August 2024 1,668 3 View

Does crude extraction using NaOH and Tris work well with Fungi?

I'm trying to find a DNA extraction method for fungi that does not require equipment and heating. Is there anyone who can suggest an alternative option? Thank you

08 August 2024 4,733 2 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

Paul F. Cliften Popular answer

If the human genome were random it would be easy to do this with probability. In fact, if the human genome were random, a 17 bp read would be unique. But, the genome isn't random; it contains many duplicated genes, repetitive elements and simple sequence repeats. Approximately 45% of the genome is repeat sequence.

I've linked an interesting BMC Bioinformatics article that shows 96-98% of 100 bp reads are unique (depending on how you look at the data) while about 98.5% of 500 bp reads are unique.

http://www.biomedcentral.com/1471-2105/15/2

Jason Peter Ross

The best way is to write a quick script and get an empirical measure. I did this years back for much smaller sizes of k with the hg18 Human genome. Essentially, over 50 nt or so, the great majority of kmers are unique with most of the non-unique stuff being repeats. Uniqueness slowly approaches the 100% mark in a rather asymptotic fashion.

Expect 500 bp reads to uniquely place far more Alu and LINE repeats and satellite DNA.

500 bp reads would be much nicer for non-model organism sequencing and joining contigs.

Jie Li

Assuming random generating reads, the possiblity to get a same sequence for a 500bp read is 1/4^500, while that for 100bp read is 1/4^100. Even if in real situation, the difference in possiblity for both conditions is very large.

Paul F. Cliften

Kt Pickard

Many thanks to everyone for their responses.

From the paper that Paul cited, 200bp reads appear to provide the best "bang for the alignment buck," with a long tail afterwards. With 1000bp reads you can uniquely identify 99.5% of the genome. If the extrapolation is valid, that number increases to 99.8% with 10k bp reads.

If longer read lengths followed same curve that sequencing costs have, the genome would be completely mapped by now. Until then, my take home is that bioinformaticians have job security for many years to come.

Not many bioinformaticians are concerned with longer contigs or finishing genomes. Typically those that work in genome centres, or those working on non-model organisms.

But, I agree, bioinformaticians will have good job security for a while yet. While some solid bioinformatics skills and also scripting language knowledge will soon become a standard bit of the toolkit for most wet lab genomics PhD students and postdocs, the bioinformatician will continue to serve as the specialist. A bit like a statistician, really. Plenty of people can do some stats, but you need a statistician for the challenging stuff.