I have annotated several genome sequences with their proteins, this is available as either a .faa Protein FASTA file of the translated CDS sequences or a .tsv of the protein annotations (outputted from Prokka). I want to first categorise these in to distinct groups by role or function (e.g membrane protein, metabolic etc). And use these groups to essentially show the relative number of proteins in the genome which have a certain role. Each sequence has well over 5000 annotated proteins so it's not a manual job. Does anyone know of any packages that could help me do this sort of thing?

Similar questions and discussions