How to generate Synthetic DNA barcode dataset?

More Rahul Arvind Jamdade's questions See All

How to simulate/generate DNA sequences with reference to the Singletone Species?

I am having many singleton Species (species representing only one DNA sequence). These singletons lack resolution potential as they are single, when applied machine learning classifiers singletons...

08 September 2019 3,121 3 View

Which are most preferred Alignment-free tools for matK and ITS2?

I have NCBI GenBank mined DNA sequences belonging to matK and ITS2 regions. I would appreciate if anyone suggests alignment free tools to calculate K-mer frequencies OR Genetic distances in order...

08 September 2019 2,764 5 View

How to remove ambiguous DNA characters form mulit-sequence FASTA file?

I am having unaligned multiple sequence FASTA file (don't want to align) with ambiguous characters: RYWSMKHBVDNS I want to remove/delete those characters... 1) I tried with mEMBOSS Readme:...

11 December 2017 3,269 7 View

Alignment-free methods: Dealing with compression-based methods for DNA barcode analysis?

Used Software: GenCompress (Download link: http://www.cs.cityu.edu.hk/~cssamk/gencomp/downGen.htm) Description: I have used test file (FASTA format: Mined from NCBI GenBank), as an input, named...

11 December 2017 429 2 View

Dealing with Mined DNA sequence: HELP in Alignment & Analysis?

I am dealing with some mined DNA sequences from GenBank of flowering plants (rbcL and matK gene), to construct phylogenetic tree out of them for finding better taxonomic resolution. ...

09 October 2017 3,332 6 View

How do you find the K-mer size of a nucleotide sequence?

The size of K-mers is between 1 and 8, for nucleotide sequence it is recommended from 4 to 7. The programs found were all in Python, Pearl, Linux or C language which are not userfriendly. Please...

02 March 2015 1,884 9 View

Can anyone suggest software, R code or perl code for determining monophyla, where input is given in tree format (Newick file)?

I am having thousands of samples (DNA barcoded) for species confirmation (False positive or True positive species). This species confirmation is to be done on distance based and tree based...

07 August 2014 9,065 7 View

Can anyone assist with a query regarding R package, SPIDER (SPecies IDentity and Evolution)?

I am using R package, "SPIDER" (SPecies IDentity and Evolution), I got its tutorial, which describes analysis for: anoteropsis, dolomedes and sarkar files (input file .rda). The tutorial works...

07 August 2014 3,497 6 View

DNA barcoding, querry regarding tree based (cluster) methods for species identification?

I am working on success rate of DNA barcoding in identification of species using distance and tree based methods. Regarding distance based method, I have used Adhoc and species identifier and...

06 July 2014 6,201 11 View

Where can we get best webcourse on Python for Biologist or Bioinformatics?

Where can we get best web-course beginning from basic Python programming and using Kernels till analyzing data with Machine learning algorithms. I think most of the Biologist don't need much to...

01 January 1970 9,575 5 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Which Scopus Journal provides the most affordable fees?

"PUBLISHING IN A SCOPUS JOURNAL" Researchers are now at a cross road. The critical need to publish in a Scopus or ISI, etc journal is ever vital. Journal Publication fees must be submitted....

10 August 2024 8,621 1 View

Seeking Advice on Viability and Execution of Undergraduate Thesis Topic?

Hello everyone, I am currently developing a thesis proposal and would appreciate your input on its viability and how to effectively carry it out. My proposed topic is: "Does the perceived threat...

10 August 2024 8,992 0 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Request Python code?

Request Python code from this article : Gender equity of authorship in pulmonary medicine over the past decade. THANKS!

08 August 2024 6,242 2 View

André Marquardt

Hey,

may I ask for what purpose you need this synthetic dataset and how shall it look like and what did you do with your "real" dataset (methodically)?

Regards,

André

Samir Mishrif Khalaf

can you clear why you need to this data-set?

Rahul Arvind Jamdade

André Marquardt and Samir Mishrif Khalaf thanks for your reply, I am using machine learning classifiers to classify species, where we have multi-sequence (DNA) dataset. Presently we have empirical dataset (Real dataset: consisting of singleton or doubleton species, affecting species resolution potential of classifiers) so if we compare efficiency of classifiers with that of Synthetic dataset (Synthesized with >10 specimens per species), we can find out precision and error rate of those classifiers.

so if I get you right, you just want to have a dataset that has been randomly generated using more than 10 species in a not further specified ratio? If I am correct with this assumption, I would recommend to just use the reference sequences of your desired species, randomly generate fragments out of those and then randomly combine them.

If I am wrong, I would be happy if you can clarify at which step I got lost.

Yeh! André Marquardt with the reference sequence of desired species, randomly generate sequences considering species divergence (not more than 2% divergent otherwise it will be considered as different species) and the effective population size (Ne), i.e., the number of individuals in a population (e.g 10 individuals of a species).

So, yet again just for clarification: You would like to have a dataset, consisting of different samples of a chooseable amount, but the data for each sample may/should have spiked in up to 2% of sequences, that are NOT of this sample but of the same species as the "main" sample? Afterwards you are going to run your trained classifier for each sample and would like to know the species as "label"?

If my summary is correct you coul do something like I recommended:

Download all desired reference sequences

Randomly "cut" these sequences to generate a huge amount of diverse fragments/reads that you could have also obtained by NGS after adapter trimming

Utilize your fragment database that you have sorted for species, to generate a species dependent but random sample with a chooseable amount of randomly spiked in fragments of other reference sequences that are within the same species.

Analysze your data

In silico working with random fragments is comparable to in vitro experiments, I have done this before: http://www.haematologica.org/content/104/2/277.long

To be honest, I am not quite sure how your classifier works an DNA sequences, but i would be happy to read the publication in the end.

Thanks Andre, could you please clarify point number 3.

How to generate a species dependent dataset that is random sampled with a chooseable amount of randomly spiked in fragments of other reference sequences that are within the same species.

I would appreciate if you could suggest any software with tutorial

no problem, you're welcome. I really do not know any software for this purpose and do not think there really is one for your purpose. I would recommend to code it on your own, this is at least the way I always do it.

Regarding point 3:

After you downloaded all reference sequences of all possible species and "samples" you want, you can easily cut our fragments of a specific length from this. In my tool I also linearized the genome, such that I do not have to mind chromosomes anymore. After cutting out for example 100.000.000 reads per species/sample you can randomly select fragments out of this database. If you want, you can also additionally choose to spike in some fragments of other species or samples (other reference genomes), leading to a in silico sample, consisting of your randomly generated fragments.

The whole process, how I did it before using human DNA as sample, is described in the publication I linked.

Hope this clarifies it a bit, feel free to ask if not.