I am trying to analyse mutation data for endometrial cancer obtained from different studies within several databases (COSMIC, cBioportal, Intogen). I have collated the data and grouped the mutations by gene. The focus of the analysis are non-synonymous coding mutations - because these mutations are most likely to cause a change in the normal protein function.

The aim of the study is to understand the mutational landscape of Endometrial cancer. The main objectives of the study are to find the commonly mutated genes in endometrial cancer, to find significantly damaging gene mutations in endometrial cancer and to create an updated list of genes comparable to commercial gene panels.

I have created this table with the collated data:

  • Gene name
  • Number of samples with coding mutations
  • Frequency ( number of samples with coding mutations / total number of samples with coding mutation)
  • CDS length
  • Total number of unique coding mutations
  • Number of unique coding: synonymous mutations
  • Number of unique coding: non-synonymous mutations
  • Mutation burden (number of unique coding: non-synonymoys mutations / CDS length)
  • Composite score [(frequency of samples * 0.7) + (mutation burden * 0.3)]
  • The idea here is to use mutation burden to imply damaging effects of the genes' mutations in endometrial cancer. We then created a composite score to use as a comparable figure between the genes.

    At the moment, our list of genes is at 16,000+. We are currently trying to think of a way to narrow down the list of genes to only focus on those significantly mutated compared to the other genes by way of statistics. Any advice is greatly appreciated.

    More Rina Nurfadlina Rosli's questions See All
    Similar questions and discussions