I am building a parallel computing tool capable of search any sequence n-mers from A-Genome along B-Genome. As a validation, found this 276,467 sequences from retrovirus vs HumanGenome Chr1
You could look at the COVERAGE of the human reference genome by n-mers from the virus and vice versa. In doing so you could allow small holes or other imperfections in otherwise perfectly covered/matching regions to vary the sensitivity of the mapping.
Visualizing this data on the human genome reference, you could see which parts of the genome are 0, 10%, 20%,...,90% or 100% covered by some virus sequence.
On the virus sequence, you could see whether different parts of the virus are equally repetitive in the reference, or taking into account the imperfections, which regions of the virus can only be mapped to the same coverage with 0%,10%,...80%, 90% mutations allowed for.
Matej that sounds good as next step, i am thinking ways to do it with an adaptation of this algorithm. Actually it goes searching all posible n-mers sequence validating String.contains() metod.
Its hard and slow for an intel core i7 but at least it runs parallel proccess.
By the way, it keeps finding new sequences, now i have a set of 194,538 sequences from 100mers to 200mers.
however, its known that this retrovirus is endogenous in human genome, but its a validation for this tool. I will follow your recomendation related to the coverage, but first i wonder wich virus-vs-genome would be apropiate for exploration.