I am trying to identify the point of insertion in the genome that may have caused the size difference in fruits between wild-type and mutant of a plant species (also mentioned here : https://www.researchgate.net/post/Identifying_insertion_of_transposable_element_in_a_large_plant_genome_any_recommendations_on_my_situation).
Generally, procedures I used to confine the mutated region include short-read whole-genome sequencing -> k-mer frequency analysis -> contig assembly -> mapping of reads, aligning contigs with databases etc.. In short, my intermediate goal is to extract only reads containing the mutated region, assemble it, assess by read-mapping, before going to some wet-lab (e.g. PCR).
Actually I am so frustrated with the long procedures I have done to the sequencing data without having any control to ensure I did not overdo something (e.g. selection criteria set too stringent). Besides, I am overwhelmed by the algorithms available (e.g. assembler) that output results showing significant difference which, again, hard to compare without any control or gold standards. It is so painful to screen something based on theoretical calculation (while there is always exception) or sometimes even intuition (e.g. the room I should leave in order not to miss any exceptions). However, in the other way round, if I just screen everything so conservatively, tons of candidate sequence remained.
Regarding the situation, I have the following questions :
1) What are the principles of controlling bioinformatics procedures?
2) What are the principles of balancing between the efficiency of screening and the risk of losing a real candidate?
3) Any other suggestions to effectively screen for the true mutation (including wet-lab)?
I am considering if I should accept the situation as it is with the quality and quantity of data on hand, anyway, I want to see how far I could go by optimizing the analysis method after getting advice on the questions.