I am trying generate a custom general feature format (GFF) file but I need to calculate the phase of the Coding Sequence (CDS) for split ORFs, that is ORFs that have multiple CDS; for instance I have a genBank (GB) file with the following entries:

[...]

CDS join(104325..104524,104708..105281) /locus_tag="PF3D7_0301800" /old_locus_tag="PFC0090w" /codon_start=1 [...] CDS join(119458..120319,120411..120986,121098..121151, 121261..121323,121462..121572,121702..121887, 121986..122144,122290..123056,123263..124735) /locus_tag="PF3D7_0302200" /old_locus_tag="PFC0110w" /codon_start=1

The CDS' phase is a concept that I haven't fully grasped. To calculate it, I have used the following bash code:

span=$(($end - $start +1))

mod=$((($span % 3)))

Thus I calculate the module for the exons contained in 104325..104524 (2) and the next set of exons within 104708..105281 will have a module 0. The +1 comes in because the CDS is calculated starting from 0 position, not 1. I tested this with the GFF check app (http://genometools.org/cgi-bin/gff3validator.cgi) validated the file and the only error comes on PF3D7_0301800's second exon set:

start end phase should be

104325 104524 2 2

104708 105281 1 0

I therefore changed the code with

carry=0

span=$(($c5 - $c4 +1))

mod=$((($span % 3) + $carry))

if [[ $mod -lt 3 ]] ; then

c8=$mod

else

c8=0

fi

now this is sorted out:

start end phase should be

104325 104524 2 2

104708 105281 0 0

but now I have an error in PF3D7_0302200:

start end phase should be

119458 120319 1 1

120411 120986 1 0

121098 121151 1

121261 121323 1

121462 121572 1

121702 121887 1

121986 122144 1

122290 123056 0

123263 124735 0

If I change `if [[ $mod -lt 3 ]] ` to `if [[ $mod -lt 2 ]]`, PF3D7_0302200 is OK but PF3D7_0301800 throws an error:

start end phase should be

104325 104524 2 2

104708 105281 0 1

119458 120319 1 1

120411 120986 1 1

121098 121151 1 1

121261 121323 1 1

121462 121572 1 1

121702 121887 1 1

121986 122144 1 1

122290 123056 0 0

123263 124735 0 0

I tried different combinations, but essentially the problem is that if I get the phase right for PF3D7_0301800 it comes wrong for PF3D7_0302200 and vice versa.

Would you know how to calculate the phase of a split CDS? Am I missing something?

Thank you.

More Luigi Marongiu's questions See All
Similar questions and discussions