Dealing with next-gen SSR Microsatellite data from target enrichment experiment?

Vania Carolina Fonseca da Silva @Vania-Fonseca-Da-Silva

10 January 2020 3 1K Report

Hi All, we have a bioinformatics challenge and we would love any help this community can offer. We have data from a target-enrichment experiment that was supposed to capture certain microsatellite motifs. The three enriched libraries were sequenced in a rapid run on Illumina Hiseq 2500 (paired end mode) and our data is in the standard illumina fastq output. Our three libraries come from three different sources. The first library is developed from fresh fish tissue; the second one is mammal tissue; and the third one is the same mammal species but from fecal samples. For the fecal samples, we need to somehow filter out sequences belonging to the mammal only (i.e. not prey or microbiome). We have a reference genome for the mammal, but not for the fish. The data has been demultiplexed already (so for the fish we have 40 individual fish each with its own .fq file containing all the read data). Now, we are facing the challenge of how to deal with this data. Although we are familiar with most basic bioinformatic tools and analyses we do not have advanced programming skills. We need to find a way not only to find and identify the length of our microsats within the reads but also (for the fecal library) somehow be able to identify unique flanking sequences that would correspond to our mammal, in such a way that the reads of other species in the fecal libraries can be excluded. Would anyone have a suggestion on what approach(es) we could use? We have already (unsuccesfully) attempted to tackle this with SSR_pipeline. Thank you in advance for any help you can offer - it is very much appreciated! Daniel & Vania

Abhinav Tyagi

Hi Vania,

If I understand your question correctly, your major issue is with your third library where you wish to segregate the reads of your focal mammal species from reads of other exogenous DNA present in the fecal DNA extract.

As you already mentioned you have a reference genome of your mammal species,

I suggest you map your raw data to the reference genome (I use BWA mem, but there are many other aligners available). Once the alignment is done, the reads which map to our reference are the ones of your mammal species and you can ignore all the reads which do not map properly.

I hope this might be of some help.

Thanks

Stuart Stephen

Abhinav Tyagi : My assumption is the faecal reads which do not map to the mammal are the reads which need to be retained. The experimental objective is to use these retained reads as some kind of marker in order to determine which fish species (or species group) was consumed as prey by the mammal. Vania Carolina Fonseca da Silva has not detailed how she intends to use these retained reads.

I have a follow up comment proposing the approach I would take, and would be interested in comments on this approach.

Stuart Stephen

Vania Carolina Fonseca da Silva You may want to consider a pangenomic approach whereby you use K-mer count sets with calculable probabilities of a faecal K-mer set originating from a fish species or group of species.

Because you have sampled enrichment you only have a probability of sampling any given SSRs in each species, and it is difficult to determine that probability for each fish species if the genome is unknown.

My suggestion is to take each set of fish reads and do a K-mer analysis, maximising the K-mer size until you obtain a distribution such that – guestimate – 10% of K-mers have coverage of between 10 and 100 copies

The guestimate of 10 is because you are expecting some coverage at the sampled SSRs and less than 10 could be indicative of sequencing errors. My guestimate of 100 is because you are looking to discard the SSR repeat polymerics and are looking for K-mers likely to be unique both intra and inter fish species. Hopefully for each fish species you will end up with many thousands of identified K-mers accepted within the copy number constraints and of maximal size up around 50% of your read lengths. You can’t use K-mer sizes up around read lengths simply because most K-mers will be discoverable through partial read overlaps, not full length overlaps.

Repeat iteratively over all fish species, adjusting the K-mer size until some minimum number of K-mers per species is reached, and build a matrix containing columns: fish species, K-mer sequence, count of that K-mer present in faecal sample reads – initialise count to 0.

With your faecal sample reads use a sliding window of the K-mer size you built the final accepted matrix with, and slide this window along each read.

For each K-mer from the sliding window check if that K-mer is present in the mammal genome – if so then discard and try next window.

If K-mer not in mammal genome then search matrix for a matching K-mer sequence and for every fish which matches then increment that fishes faecal K-mer match count.

When all K-mers in the faecal sample reads have been processed you can process the counts in the matrix and assign probabilities of the mammals individual fish species (or groups of fish) consumption.

I suggest that you discard counts of less than some threshold – say 10 - as these could represent sequencing errors.

The foregoing approach will need much optimisation to become practical, but could be applied even when the prey consumers genome has yet to be assembled providing you have reads from the consumer species. K-mers from the consumer reads are added to the matrix and so if count more than some threshold then discard that K-mer from further analysis!

I am interested in your research and may be able to provide some bioinformatics expertise ( https://github.com/kit4b ), do you have further details of your experiment you can share?

Mouse CD3 antibody sequence?

Water Consumption in Plastic Recycling: Is it Higher Than for Virgin Plastics?

Qual o principal pensamento do docente que irá formar licenciados em pedagogia?

Does the potentiostat have to be in a fume hood when polymerising the PANI?

Does anyone know how to improve calcium-imaging fluorescence with GRIN lenses?

What do you think is the main advantage of cell membrane-camouflaged nanoparticles, according to your experience?

Full text of Kelikian, 1957?

Where is China going?

How to obtain nanoparticles from microparticles?

How to study the interaction established between two drug molecules?

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

How to confirm the site-directed mutagenesis result without performing NGS?

RNA later for the preservation of RNA in fecal samples at room temperature for one day (37°C)?

If we are using snowball sampling technique, how do we justify the true representativeness of the sample statistically? is there any statistical test?

Why did the authors extrapolate a phenotype that they experimentally proved in one bacterial strain across the whole genus of the organism?

How to quantify polystyrene microplastic (8 micron) bioaccumulation in fish tissue?

What is the best sampling strategy?

Does anyone have issues using Prepman Ultra reagent for MicroSeq ID bacterial, fungal and yeast sample preparation?

Learning in Animals with Unlimited versus Limited Neurogenesis?

Seeking Software Recommendations for SELEX NGS Data Analysis?