I am trying generate a custom general feature format (GFF) file but I need to calculate the phase of the Coding Sequence (CDS) for split ORFs, that is ORFs that have multiple CDS; for instance I have a genBank (GB) file with the following entries:
[...]
CDS join(104325..104524,104708..105281) /locus_tag="PF3D7_0301800" /old_locus_tag="PFC0090w" /codon_start=1 [...] CDS join(119458..120319,120411..120986,121098..121151, 121261..121323,121462..121572,121702..121887, 121986..122144,122290..123056,123263..124735) /locus_tag="PF3D7_0302200" /old_locus_tag="PFC0110w" /codon_start=1
The CDS' phase is a concept that I haven't fully grasped. To calculate it, I have used the following bash code:
span=$(($end - $start +1))
mod=$((($span % 3)))
Thus I calculate the module for the exons contained in 104325..104524 (2) and the next set of exons within 104708..105281 will have a module 0. The +1 comes in because the CDS is calculated starting from 0 position, not 1. I tested this with the GFF check app (http://genometools.org/cgi-bin/gff3validator.cgi) validated the file and the only error comes on PF3D7_0301800's second exon set:
start end phase should be
104325 104524 2 2
104708 105281 1 0
I therefore changed the code with
carry=0
span=$(($c5 - $c4 +1))
mod=$((($span % 3) + $carry))
if [[ $mod -lt 3 ]] ; then
c8=$mod
else
c8=0
fi
now this is sorted out:
start end phase should be
104325 104524 2 2
104708 105281 0 0
but now I have an error in PF3D7_0302200:
start end phase should be
119458 120319 1 1
120411 120986 1 0
121098 121151 1
121261 121323 1
121462 121572 1
121702 121887 1
121986 122144 1
122290 123056 0
123263 124735 0
If I change `if [[ $mod -lt 3 ]] ` to `if [[ $mod -lt 2 ]]`, PF3D7_0302200 is OK but PF3D7_0301800 throws an error:
start end phase should be
104325 104524 2 2
104708 105281 0 1
119458 120319 1 1
120411 120986 1 1
121098 121151 1 1
121261 121323 1 1
121462 121572 1 1
121702 121887 1 1
121986 122144 1 1
122290 123056 0 0
123263 124735 0 0
I tried different combinations, but essentially the problem is that if I get the phase right for PF3D7_0301800 it comes wrong for PF3D7_0302200 and vice versa.
Would you know how to calculate the phase of a split CDS? Am I missing something?
Thank you.