banner



Silent Missense And Nonsense Mutations

  • Research commodity
  • Open up Access
  • Published:

Gene characteristics predicting missense, nonsense and frameshift mutations in tumor samples

  • 2891 Accesses

  • 7 Citations

  • 1 Altmetric

  • Metrics details

Abstract

Groundwork

Because driver mutations provide selective advantage to the mutant clone, they tend to occur at a higher frequency in tumor samples compared to selectively neutral (passenger) mutations. All the same, mutation frequency lone is insufficient to identify cancer genes because mutability is influenced past many factor characteristics, such every bit size, nucleotide composition, etc. The goal of this report was to place gene characteristics associated with the frequency of somatic mutations in the gene in tumor samples.

Results

We used information on somatic mutations detected past genome wide screens from the Catalog of Somatic Mutations in Cancer (Catholic). Gene size, nucleotide composition, expression level of the cistron, relative replication time in the cell cycle, level of evolutionary conservation and other cistron characteristics (totaling eleven) were used as predictors of the number of somatic mutations. We applied stepwise multiple linear regression to predict the number of mutations per factor. Because missense, nonsense, and frameshift mutations are associated with dissimilar sets of gene characteristics, they were modeled separately. Gene characteristics explain 88% of the variation in the number of missense, twoscore% of nonsense, and 23% of frameshift mutations. Comparisons of the observed and expected numbers of mutations identified genes with a college than expected number of mutations– positive outliers. Many of these are known driver genes. A number of novel candidate driver genes was also identified.

Conclusions

Past comparing the observed and predicted number of mutations in a gene, nosotros take identified known cancer-associated genes as well as 111 novel cancer associated genes. We likewise showed that adding the number of silent mutations per gene reported by genome/exome broad screens across all cancer type (Catholic data) as a predictor substantially exceeds predicting accuracy of the most popular cancer gene predicting tool - MutsigCV.

Background

Predictive differentiation between functional and neutral somatic and germline mutations was and continues to be a hot topic of bioinformatics research. A number of tools using a number of predictors including, level of evolutionary conservation, effect on protein structure, functional DNA sequences, e.g. transcription factor binding sites and other have been developed [1,ii,3,four,v,6,7]. However, more specific topic, namely development of tools for identification of cancer-associated genes gets less attention.

In many cases cancer development is driven by somatic mutations. [8] Mutations providing a proliferative or survival reward to the mutant clone (drivers) occur more frequently in tumor samples compared to selectively neutral (rider) mutations. [9, 10] Known cancer-associated genes are among the most frequently mutated genes. In general, the number of somatic mutations per factor indicates the cistron's interest in cancer development. However, a simple counting of somatic mutations can exist misleading considering the number of mutations per gene depends non only on the interest of the gene in tumorigenesis but also on the gene'southward intrinsic mutability that in turn depends on factor characteristics.

A number of gene characteristics have been shown to be associated with mutability. It has been shown that genes with a higher expression level tend to have a higher frequency of somatic mutations. [eleven, 12] Some other known factor characteristic associated with mutability is relative replication time within cell bicycle: later replicating genes tend to have a college number of somatic mutations. [11, 12] Chromatin accessibility has been shown to exist positively associated with the density of somatic mutations. [xiii] Differences in mutation rate of different nucleotide substitutions, e.thousand. loftier frequency of transitions in CpG sites [14] suggest that nucleotide limerick of the gene also may be associated with mutability. Those and other cistron characteristics are inter-correlated. Gene length has been shown to be correlated with selective codon usage (nucleotide composition) [15] Replication timing is correlated with gene expression level [16] We found that size of the factor positively correlates with the level of evolutionary conservation. [17] Inter-correlations between predictors call for a multivariate regression model to predict the number of somatic mutations in the gene. According to our initial analyses, missense, nonsense and frameshift may take unlike sets of predictors (gene characteristics) and therefore need to be modeled separately. A recent study by Martincorena et al. [18] used normalized ratio of not-synonymous to synonymous mutations to identify genes under positive or negative selection in cancer evolution. The authors noted that near one-half of the identified driver mutations "occur in yet-to-be-discovered cancer genes".

Our analysis is based on the hypothesis that inter-cistron variation in the number of somatic mutations has two sources: (1) the variation due to differences in gene characteristics, and (two) the variation due to the involvement of the gene in cancer development. We tried to explain the intergenic variation in the number of somatic mutations past the variation in gene characteristics. Outliers – genes for which the number of somatic mutations cannot be explained by factor characteristics are candidate cancer genes.

Methods

Design of the study

The goal of this study is to build statistical model for prediction of the expected number of somatic mutations in a given gene based on the gene characteristics. To build the model we used somatic mutation data generated past whole exome sequencing of tumor samples. Nosotros separately predicted missense, nonsense, and frameshift mutations. Residuals from the models were analyzed to detect outliers – genes with a college-than-expected number of mutations. The excess of mutations unexplained by gene characteristics is due to the factor involvement in cancer development and can be used to identify cancer-associated genes.

Mutation data

We used mutation data from the Catalog of Somatic Mutations in Cancer (Catholic) (accessed August 17, 2017). To ensure compatible testing across all genes, only mutations detected past whole genome screens were used. All cancer types were included in the assay. A full of 19,147 tumor samples were analyzed. Mutations reported as SNPs were excluded from the assay. In full in that location were ii,233,115 missense, 163,823 nonsense, and 85,272 frameshift (FS) mutations, including those resulted from nucleotide insertions as well as nucleotide deletions.

Gene characteristics

The post-obit gene characteristics were used as predictors:

  1. one.)

    Factor size. We used data from the NCBI Consensus coding sequence project to estimate gene coding region sizes. [19] When multiple transcripts were reported for the aforementioned gene, the largest transcript was used. A moving boilerplate was used to illustrate the relationship between the gene size and the number of somatic mutations in it. In brief, genes were ranked based on the size from shortest to longest. The sliding window of 100 nucleotides was moved along the genes with one nucleotide step. We found that this size of the sliding window is optimal for smoothing of the relationship while keeping the furnishings of strong outliers like TP53 visible. The average size and average number of mutations were computed for each position of the window. Scatterplots were used to visualize the relationship betwixt the cistron size and the number of mutations. The moving average approach was used to visualize the relationships between the number of mutations in the factor and other predictors.

  2. 2.)

    Number of potential sites for a given type of mutations. The type of mutation produced by a single nucleotide exchange (SNS) depends on type of SNS (e.g. C > T) and its position in a given codon. There are three possible SNSs per each nucleotide position which makes the full number of all possible SNSs in the factor equal to 3xN, where N is the length of the coding region in nucleotides. We predicted outcomes of all possible SNSs in each factor to estimate the number of SNSs producing missense, nonsense or silent mutations in the factor – the number of potential sites in a gene for a given type of somatic mutations.

  3. 3.)

    Nucleotide limerick. For each cistron we estimated the proportions of each of the four nucleotides in the coding region of the gene. The human relationship between the percentage of each nucleotide and mutation densities were analyzed. Mutation densities were computed as the ratios of the full number of mutations to the size of the coding region of the gene in nucleotides. We used the density rather than the number of mutations per gene to business relationship for the effect of the gene size.

  4. 4.)

    Percentage of CpGs. Mutation charge per unit is known to exist higher in CpG dinucleotides [14] suggesting that genes with a college proportion of CpG may have a higher mutation charge per unit and as a result a higher number of somatic mutations. We used per centum of CpGs every bit a predictor of mutation density.

  5. 5.)

    Evolutionary conservation. Some studies betoken that evolutionary conservation of the factor correlates with mutability. [20] Equally a measure of evolutionary conservation of the factor we used conservation index. [21] Orthologs for each gene were identified amongst xx species with consummate genome sequences: Pan troglodytes, Macaca mulatta, Canis lupus familiaris, Bos taurus, Mus muscle, Rattus norvegicus, Gallus gallus, Xenopus tropicalis, Danio rerio, Drosophila melanogaster, Anopheles gambiae, Caenorhabditis elegans, Saccharomyces cerevisiae, Kluyveromyces lactis, Eremothecium gossypii, Schizosaccharomyces pombe, Magnaporthe oryzae, Neurospora crassa, Arabidopsis thaliana, and Oryza sativa. Conservation alphabetize of 1 was assigned to the genes with 0 or 1 orthologs, conservation index two was assigned to the genes with two or iii orthologs so on.

  6. 6.)

    Cistron expression level. It has been shown that the expression level of the gene negatively correlates with the density of somatic mutations. [11, 12] Gene expression data for 1037 cancer cell lines were downloaded from the Cancer Cell Line Encyclopedia (CCLE). [22] For each cistron we computed boilerplate expression across CCLE prison cell lines and used it every bit a predictor of the mutation density.

  7. 7.)

    Nucleotide diversity. We noted bong-shaped curves describing the relationship between the pct of nucleotides and the density of missense mutations suggesting that genes with like percentages of all nucleotides (25% each) may tend to have a higher density of somatic mutations. To account for this result we devised a single measure characterizing how strongly the proportions of four nucleotides deviate from being equal. Nosotros called this measure nucleotide diversity (ND). ND was defined every bit the probability that ii nucleotides randomly selected from the gene coding sequence are different: ND = one-(P(A) 2 + P(C) 2 + P(One thousand) 2 + P(T) ii), where P(A), P(C), P(G), and P(T) are the percentages of each nucleotide in the gene. ND was computed for each cistron and used as a predictor.

  8. 8.)

    SNP density. Genes with a high propensity to mutate are also expected to have a college density of germline polymorphisms. We used SNPs to guess the density of germline polymorphisms in a gene. SNP density was computed as a ratio of the total number of unique SNPs in the coding region to its size in nucleotides. SNPs detected by the 1000 genomes projection [23] were used in this analysis to ensure that different genes were targeted the same number of times.

  9. 9.)

    Density of the silent mutations. Fifty-fifty though some silent mutations are known to be functional [24], most of them are neutral and therefore the density of silent mutations in the gene tin be used as a quantitative measure of mutability of the gene. We computed the density of silent mutations for each gene and used it as a predictor.

  10. x.)

    Relative replication time. Late-replicating genes tend to have a college number of mutations. [11, 12] Nosotros used the relative replication time data from Ryba et al. (2012). [25] Human genome build GRCh38 was used to match the positions of probes with positions of the genes. When several probes were mapped to the same gene, average replication time for all probes in the cistron was used as a predictor. The closest probe was used when at that place were no probes in the cistron. The relative replication time (negative for early and positive for late-replicating genes) was used as a predictor.

  11. 11.)

    Chromatin accessibility. Chromatin accessibility has been shown to be associated with mutability of the region. [xiii] Data from the report by Sos et al. [26] were used in chromatin accessibility assay. The written report used transposon hypersensitive sites sequencing assay to assess chromatin accessibility. The mean chromatin accessibility beyond 10 lymphoblastic cell lines was computed for each factor and used every bit a predictor for density of missense, nonsense and FS mutations separately.

  12. 12.)

    Covariates from MutsigCV. We besides included 3 predictors (co-variates) used by MutsigCV: "expr", "hic" and "reptime" [12]. "Expr" is the expression level of this gene, averaged across 91 prison cell lines in the Cancer Prison cell Line Encyclopedia. "Reptime" is replication time of this gene (measured in HeLa cells), ranging from 100 (very early) to 1000 (very tardily). "Hic" chromatin state of this gene (measured from HiC experiments in K562 cells) ranging from − fifty (very closed) to + 50 (very open). We used similar predictors factor expression, relative replication fourth dimension and chromatin accessibility. The difference of our predictors from those used by MutsigCV was sources of the data: we used dissimilar studies to judge the same gene characteristics. Past using different sources nosotros tin assess the reliability of the predictors and their sensitivity to the source of the information.

Statistical analysis

As a commencement step for statistical analysis nosotros examined descriptive statistics for predictors and consequence and estimated pairwise correlations between predictors across fifteen,610 genes. Nosotros used non-parametric Spearman'south rank club correlation. We used a stepwise multiple linear regression model implemented in STATISTICA (StatSoft) to identify a best subset of predictors of the number of mutations per gene. Residual analysis was used to discover outliers – genes with a higher than expected number of missense, nonsense, or FS mutations. For each gene, residual Z-scores were computed separately for missense, nonsense and FS mutations. Residuals from the prediction models follow standard normal distribution N(0,i). Z-score is the signed value of standard deviations from mean which is zero for standard normal distribution. Positive Z-score indicates an excess and negative - a deficit of mutations in the gene compared to the expected numbers. The absolute value of Bonferroni corrected Z values based on xv,610 tests (the total number of genes used in the analysis) was further corrected equally being a maximum of three Z-scores. Only genes with complete information for all predictors were used in this analysis. Nether the assumption of independence of the iii scores, the threshold used for significance was: \( {\Phi}^{-1}\left(\sqrt[3]{1-\blastoff /north}\right) \), where Φ−1(p) denotes the quantile office of the normal distribution, α = 0.05 andn = 15,610, which yielded a cutoff value of 4.74.

Results

As expected, stiff positive associations between the gene size and the number of mutations were detected for all types of mutations (Fig. 1). Like relationships were detected with the number of potential sites (Additional file 1).

Fig. 1
figure 1

The relationship between the number of missense, nonsense, and frameshift mutations and cistron size

Full size prototype

Figure 2 shows the relationship between the nucleotide composition and the density of missense (first cavalcade), nonsense (second column) and frameshift (tertiary column) mutations. For nonsense mutations, there was a linear relationship between the per centum of each nucleotide and the mutation density, equally expected from the nucleotide composition of stop codons (TAA, TAG, and TGA). Peaks on the curves are driven by CDKN2A and TP53. These genes accept a much larger number of nonsense mutations compared to the genes with a similar nucleotide composition. For missense mutations, the peaks are driven by TP53 and KRAS. A curvilinear shape describes the relationships between the percentages of "A" and "C" nucleotide percentage and density of missense mutations. The peak coincides with nucleotide densities close to 0.25.

Fig. 2
figure 2

The relationship between the nucleotide composition and the density of missense (first cavalcade), nonsense (2d column), and FS (third cavalcade) mutations

Full size image

We observed an up-going tail on the left side of the curve describing the relationship betwixt the percentage of "1000" and the density of missense mutations. A similar up-going tail was observed on the correct side of the bend describing the relationship betwixt the percentage of "T" and the density of missense mutations. Both tails are driven by olfactory receptor genes (total 368). We found that the density of missense mutations in olfactory receptors is twice that of other genes in the human genome: 107.v ± 2.9 versus 49.four ± 0.4 mutations per 1 kb. Densities of nonsense and FS mutations in olfactory genes are not elevated. Olfactory genes also have an unusually depression percentage of "Thou" and a high percent of "T". The percentages of "A", "C", "M" and "T" in olfactory genes are correspondingly 22.1 ± 0.3, 26.vi ± 0.iii, 20.2 ± 0.3, and 31,1 ± 0.iii, while the corresponding percentages in all other genes are 24.iii ± 0.i, 26.three ± 0.one, 27.8 ± 0.1, and 21.vi ± 0.ane. The combination of an "abnormal" nucleotide composition and a higher density of missense mutations result in upward-going tails for missense mutations: left for the percentage of "1000" and right for the percent of "T". When olfactory genes were removed from the analyses, the upward-going tails disappeared (Additional file two).

For frameshift mutations, nosotros detected a positive linear relationship between the percentage of "A" and the density of mutations and a negative relationship with the percentage of "G". Densities of missense and nonsense mutations were negatively associated with both the percentage of CpGs and the level of evolutionary conservation (Boosted files 3 and 4, respectively).

We observed a negative association between the average expression level in CCLE cancer cell lines and the mutation densities (Fig. 3a). Because the curves were L-shaped, we log-transformed factor expression values. The transformation improved the Rtwo derived from linear regression from 0.59 to 0.69 for missense, and from 0.xviii to 0.27 for nonsense mutations. Correlation betwixt gene expression and the density of frameshift mutations was not pregnant. We also noted a strong positive association between the density of silent mutations in the gene with the densities of other mutation types (Fig. 3b). Figure 3c shows the relationship between the mutation densities of missense, nonsense and FS mutations and the relative replication time. Consistent with published studies [xi, 12] we observed a potent positive association between the replication fourth dimension and the mutation density for missense and nonsense mutations but not for frameshift mutations.

Fig. 3
figure 3

(a) The relationship between average expression in CCLE cancer prison cell lines and the mutation densities. (b) The relationship betwixt the density of silent mutations and the densities of missense, nonsense and frameshift mutations. (c) The human relationship between the relative replication fourth dimension and the densities of missense, nonsense, and frameshift mutations

Full size prototype

A positive clan betwixt the nucleotide diversity (ND) and the densities of missense and nonsense mutations was noted (Additional file v). A significant negative association between chromatin accessibility and the density of missense and nonsense mutations in the gene has been observed (Additional file 6).

Correlations between predictors

We found that gene characteristics used in this analysis are highly correlated (Table i). Out of 120 possible pair wise correlations, 112 pairs were statistically significant. Bated from expected correlations, eastward.yard. correlation between the number of potential sites for mutations and gene size, we observed a number of unexpected correlations. For instance, we noted that larger genes tended to accept a college pct of "A" nucleotides. Larger genes also tended to accept higher evolutionary conservation indices. Genes with a higher expression level tended to replicate earlier. Considering of widespread correlations among predictors we used stepwise all-time subset multivariate regression.

Tabular array i Pair-wise correlations between gene characteristics

Full size tabular array

Univariate analyses

Below we present the results of univariate regression with the number of mutations in the cistron as the event and gene characteristics as predictors.

Missense mutations

In the univariate analysis, the about meaning predictor of the number of missense mutations was the number of silent mutations in the cistron (Table 2). Factor size and the number of potential missense mutation sites were the next most significant predictors with like levels of significance. Relative replication time from MutsigCV ("reptime") and our analogous predictor (relative replication time) testify similar levels of significance. Our predictor "Gene expression in CCLE cancer cell lines" was more than significant compared to the analogous predictor from MutsigCV – "expr". For chromatin accessibility, MutsigCV predictor "hic" was more than significant compared to our predictor "Chromatin accessibility".

Table 2 Gene characteristics associated with the number of missense mutations per gene in univariate regression models

Total size table

Nonsense mutations

Tabular array three shows results of univariate assay for nonsense mutations. The number of potential sites for nonsense mutations was the most significant predictor, followed past the factor size and number of silent mutations. Compared to missense mutations nucleotide composition seem to be more important for the prediction of nonsense mutations. This is probable due to the fact that a subset of codons capable to produce nonsense mutations tends to be A-rich and G-poor.

Table 3 Cistron characteristics associated with the number of nonsense mutations in the univariate linear regression model

Full size table

Frameshift mutations

Table iv shows the results of univariate analyses for FS mutations. The gene size was the most meaning predictor followed past the number of silent mutations. The nucleotide limerick was as well significant with C + Thousand rich genes having lower number of FS mutations. The level of evolutionary conservation was positively associated with the number of FS mutations in the gene.

Table four Gene characteristics associated with the number of FS mutations per factor in univariate linear regression model

Total size table

Prediction of the number of missense, nonsense and frameshift mutations together

Table 5 shows predictors for missense, nonsense and frameshift mutations analyzed together. The results of this analysis are similar to the results of the analysis of missense mutations.

Table 5 Gene characteristics associated with the number of missense, nonsense and frameshift mutations analyzed together in univariate linear regression model

Full size table

Predictors for multivariate analysis

We selected predictors for multivariable analysis based on their significance in univariate analyses and the linearity of the association with the result. Table 6 shows the cistron characteristics selected for each blazon of mutations. In all multivariate analyses we likewise included 3 covariates from MutsigCV (not shown in Table six). Olfactory genes were excluded because of their distinctive nucleotide limerick and high density of missense mutations. TP53, CDKA2, and KRAS were also excluded from the analyses considering they were obvious outliers in univariate analyses.

Table 6 Gene characteristics selected for the model building for the missense, nonsense, and frameshift mutations

Full size tabular array

Multivariate analysis

Prediction of missense mutations

Table 7 shows missense mutations predictors that remained significant in the stepwise best subset linear regression. The most significant predictor was the number of silent mutations in the gene. Nucleotide diversity and the percentages of "C" and "Chiliad" nucleotides were also meaning. The R2 for the whole model was 0.88. Additional file 7 shows the relationship betwixt the predicted and the observed numbers of missense mutations.

Tabular array 7 Gene characteristics meaning in stepwise best subset multiple linear regression model for the prediction of the number of missense mutations

Full size table

Prediction of nonsense mutations

Table 8 shows cistron characteristics that remained pregnant in the multiple linear regression model for nonsense mutations. The most significant predictor was the number of potential sites for nonsense mutations. The other significant predictors included number of the detected silent mutations and the gene size. The model R2 was 0.40. Additional file 8 shows the relationship between the predicted and the observed numbers of nonsense mutations.

Table 8 Factor characteristics significant in stepwise all-time subset multiple linear regression model for nonsense mutations

Full size table

Prediction of frameshift mutations

Table 9 shows predictors that remained significant in the multiple linear regression model for FS mutations. Factor size was the well-nigh significant predictor followed by the nucleotide diversity (negative association) and the percentages of "A" and "C" nucleotides that were positively associated with the number of FS mutations in the gene. The R2 of the model for FS mutations was 0.23. Additional file 9 shows the human relationship betwixt the predicted and the observed numbers of FS mutations.

Table nine Gene characteristics meaning in stepwise all-time subset multiple linear regression model for frameshift mutations

Full size table

Prediction of the number of missense, nonsense and frameshift mutations together

Table ten shows predictors for missense, nonsense and frameshift mutations analyzed together. The gene size was the well-nigh significant predictor, followed by the nucleotide diversity (negative association) and the percentage of "A" and "C" nucleotides (positive associations). The R2 of the model for all mutations was 86%.

Table 10 Cistron characteristics significant in stepwise best subset multiple linear regression model for missense, nonsense, and frameshift mutations analyzed together

Full size table

Mutation type specific models

Nosotros tested how well the pan-mutation model works for predicting missense, nonsense and FS mutations separately. Nosotros compared them with mutation type specific models by the prediction accuracy. Riisouth were used to evaluate how well the model accounts for gene characteristics. R2s were computed by comparison of the observed and predicted number of mutations in the genes.

The pan-mutation model predicts missense mutations almost also every bit the missense-specific model described earlier: R2 = 0.86 vs R2 = 0.88. This is likely considering the majority of the mutations are missense mutations (88%) so when nosotros build a pan mutation model information technology is generally built for missense mutations. For nonsense mutations R2 for the pan-mutation was 0.34 while R2 for the nonsense-specific model was higher - Rii = 0.46. The type-specific model was also more authentic for frameshift mutations R2 = 0.22 versus R2 = 0.16. Therefore, the pan-mutation model works well for missense mutations, but for nonsense and frameshift mutations type-specific models perform better.

Boosted gene characteristics to ameliorate the prediction accuracy of MutsigCV

MutsigCV is one of the nigh popular and efficient tool for identification of cancer genes from mutation data [27]. MutsigCV predicts the number of mutations in a gene based on the gene size and the number of silent mutations detected in a given set up of tumor samples. Three other characteristics, "expr" – gene expression, "hic" – open chromatin and "reptime" – relative replication time are used equally co variates. We tested if the inclusion of additional gene characteristics could better prediction accuracy of MutsigCV. We used MutsigCV to identify cancer genes for analyses three dissimilar TCGA datasets: LUAD (Lung adenocarcinoma), LUSC (lung squamous jail cell carcinoma) and SKCM (pare cutaneous melanoma) with similar results. Here we prove the results generated by an analysis of LUAD information as an example. MutsigCV identified ten lung adenocarcinoma associated genes: KRAS, TP53, STK11, KEAP1, SMARCA4, EGFR, RBM10, C3orf27, ZNF831, and OR5M11. Stepwise multivariate mutation-specific regression models identified a partially overlapping set up of 21 cancer-associated genes: EGFR, TP53, KRAS, SI, STK11, FLG, PTPRD, COL11A1, LRP1B, FBN2, NEIL3, CSMD3, SPTA1, CDH10, PCLO, MYH1, USH2A, SPHKAP, ZNF804A, XIRP2, and ZNF831.

We tested if inclusion of boosted gene characteristics identified in our study improves the prediction accuracy of MutsigCV. The inclusion of the nucleotide limerick, the nucleotide variety, gene expression, and the replication fourth dimension only slightly improved Rtwo compared to the set of predictors used by MutsigCV: 0.lx versus 0.58. Hovewer, adding the number of silent mutations reported by genome wide screens in COSMIC led to substantial comeback in prediction efficacy: 0.66 vs 0.58. Similar results were obtained for LUSC and SKCM data. Therefore, incorporating the number of silent mutations reported by genome wide screens across different cancer types can significantly improve prediction accuracy of MutsigCV.

Genes with a higher than expected number of mutations (positive outliers)

We identified 111 positive outliers - genes with a significant backlog of missense, nonsense, or frameshift mutations, afterward the aligning for multiple testing (Boosted file ten). TP53 and PTEN have a higher than expected number of all three types of mutations. Five genes, ATM, LRP1B, CSMD3, FBXW, and SMAD4 accept an excess of missense and nonsense mutations. Three genes, COL11A1, SLC25A5, and PCLO evidence a pregnant backlog of frameshift and missense mutations. Twelve genes: APC, AXIN1, TET2, ASXL1, ARID2, RB1, NF1, VHL, PBRM1, KMT2D, KMT2C, and ARID1A, show an backlog of frameshift and nonsense mutations.

Z-scores for known cancer-associated genes

We computed Z-scores for known tumor suppressor genes (TS) and oncogenes (OGs) and compared them with Z-scores for other genes in the human genome. TS and OGs were defined past UniprotKB database [28, 29]. There are 233 OGs and 176 TSs. Genes that are not reported as TSs or OGs (other genes) were used every bit a reference group. The mean Z-score for known TSs was significantly higher for FS, missense, and nonsense mutations compared to Z-scores for all other genes. For known OGs the mean Z-score was higher for missense mutations only (Fig. iv). A higher Z-score for missense mutations is expected because typically activating missense mutations in oncogenes bulldoze tumorigenesis. [30, 31].

Fig. iv
figure 4

Z-scores for known tumor suppressor genes (TS), oncogenes (OG) and the genes that are not reported by UniprotKB every bit TS or OG – other genes. Z-scores for FS, missense (Mis.) and nonsense (Non.) mutations are shown separately. Vertical confined indicate the standard mistake of hateful

Full size prototype

Major findings

We plant that gene characteristics can explain considerable proportion of inter genic variation in the number of somatic mutations: 88% for missense, xl% for nonsense, and 23% for frameshift mutations. Many genes with a higher-than-expected number of mutations (positive outliers) were also identified. Over hundred positive outliers were non previously reported by the COSMIC cancer consensus database and therefore can be considered as novel candidate cancer genes.

Discussion

A goal of this report was to identify gene characteristics associated with the number of somatic mutations in tumor samples. Since gene characteristics nosotros used every bit predictors are inter-correlated, nosotros practical stepwise all-time subset regression model. Regression models explain 88% of variation in the number of missense, 40% nonsense, and 23% of frameshift mutations. If we presume that the unexplained variation in the number of mutations is due to an involvement of the cistron in cancer development, the results testify that FS near often associated with tumorigenesis followed by nonsense and missense mutations.

Each factor in the human genome acquires mutations on background level based on intrinsic mutability of the factor which depends on gene characteristics. Cancer associated genes are expected to take extra mutations due to option of clones with driver mutations. In our analysis positive outliers (genes with a higher-than-expected number of mutations) were considered every bit candidate cancer associated genes. The majority of outliers are known cancer-associated genes. We also identified a number of novel putative cancer-associated genes. We considered a cistron as a novel cancer-associated gene when the post-obit three criteria were satisfied: the gene is not listed among (1) COSMIC cancer demography genes; (2) Mayo Clinic 50 gene cancer console [32] or (3) Foundation Medicine 315 cistron panel. We accept identified eighteen novel cancer-associated genes with an backlog of missense mutations: MUC4, CSMD3, FLG, USH2A, DNAH8, FAT4, MUC17, MUC16, SYNE1, COL11A1, RP1, SI, SACS, SLC25A5, DMD, DST, XIRP2, and PKHD1L1. We as well identified 67 genes with an backlog of FS and/or nonsense mutations: ACVR2A, SOX9, RPL22, CDCP2, CRIPAK, FAT1, BAX, BCL9L, SON, TTK, ZFP36L2, RBMX, XYLT2, USP35, WBP1, BMPR2, ZDBF2, MBD6, TCF7L2, PABPC3, ESRP1, ZC3H18, TDG, SLC23A2, JPH4, UBR5, PDS5B, IL32, BCL9, SYCP1, PRRT2, ROBO2, TEAD2, ZNF626, CASP8, RBM10, WNT16, PTCHD3, CD3G, RTKN2, PLEKHA6, AKAP7, DDX27, SEC63, ADNP, NKTR, NDUFC2, MANEA, SYNJ2, TMEM60, ARV1, LARP4B, PHACTR4, TBX3, HNRNPL, PRRG1, MCPH1, CEP290, MAP7D1, CCDC73, GPATCH4, TGIF1, FAM111B, CLOCK, SCLT1, HOXB3, and SRRT. A larger number of novel cancer-associated genes identified through the analyses of FS and nonsense mutilations compared to the analysis of missense mutations can be due to the fact that a large proportion of variation in number of mutation is due to factor involvement in cancer evolution.

For some genes in the human genome, the total number of missense mutations does not differ significantly from the expected number, hovewer, those mutations are amassed. For example, the observed number of missense mutations in AKT1 oncogene is 113. This does not differ significantly from the expected number of the mutations (70), Z(M) = 0.86. However, the bulk (86 out of 113) of the mutation counts are p.E17K mutation. If nosotros exclude p.E17K, in the reminder of the AKT1 factor the observed number of mutations is lower than expected: 27 observed versus 70 expected. The lower number of mutations in the rest of the gene may be due to the fact that well-nigh of the coding region (85%) is occupied past functional domains. Missense mutations in functional domains may be loss-of-role mutations and as a result are negatively selected in tumors. Because our modeling does not have into account the distribution of mutations within the coding region, information technology may miss cancer genes with a clustering of functional mutations but a similar number of observed and expected mutations.

Interestingly, many novel cancer-associated genes identified past the backlog of missense mutations are large genes with repetitive functional domains: LRP1B, CSMD3, FLG, USH2A and others. In these genes functional mutations tend to be uniformly distributed across repetitive functional domains. For case, ane of the frequent mutations in CSMD3 gene is G > A substitution. It leads to arginine (R) to glutamine (Q) substitution. The mutation is reported at position 11 of the repetitive sushi domain: sushi domain #v (2 mutations), sushi domain #7 (4 mutations), sushi domain #9 (vii mutations), and sushi domain #13 (6 mutations). Taking into account that 92% of mutations in the gene are singletons, the observed pattern is likely to reflect the beingness of multiple peaks distributed across repetitive functional domains.

Nosotros establish that a small number of factor characteristics predict a large part of variation in the number of mutations per gene. "Number of silent mutations in the cistron" alone explains 84.iii% of variation in the number of missense mutations per gene. Adding "Percentage of "C"" and "Nucleotide diversity" improves prediction accuracy to 85.7 and 85.8% correspondingly. Adding last four predictors listed in Table 8 increases R2 from 85.7 to 88.one%. Therefore, the starting time three predictors explain most of the variation in the number of missense mutations per cistron.

For nonsense mutations, the number of potential sites for nonsense substitutions alone explains 34.7% of variation. Adding the number of silent mutations in the gene as a predictor increases R2 to 37.4%. Adding the gene size every bit a predictor further increases Rii to 39.four%. Including all meaning predictors listed in Tabular array eight makes Rtwo equal to 39.6%.

For frameshift mutations, the gene size alone explains 21.6% of variation. Calculation eight other significant predictors listed in Table 10 leads to only an incremental increase in R2 to 22.eight%.

We establish that the number of silent mutations reported past COSMIC genome wide screens across all cancer types is the most pregnant predictor of missense mutations. It as well contributed significantly to the prediction of nonsense equally well equally frameshift mutations. The number of silent mutations is the about important predictor of the number of somatic mutations in the gene because information technology is an integrative indicator of the background mutability of the gene.

The strongest predictor of nonsense mutations was the number of potential sites for that type of substitutions. Information technology explains 34.vii% of total variation. Only 21 out of possible 64 codons are capable of producing nonsense mutations by SNSs. The number of potential sites for nonsense mutation varies an society of magnitude across genes, from 0.03 per nucleotide for MUC21 to 0.29 for KRTAP20–1. The ability of the gene to generate nonsense mutations depends on codon composition.

We too found that the total number of silent mutations per gene reported by genome screens in COSMIC across unlike cancers improves the predicting accuracy of MutsigCV. MutsigCV uses the number of silent mutations in analyzed ready of tumor samples as a predictor. The number of silent mutations in a unmarried sample tends to accept a large variation because the typical sample size is small. Likewise dissimilar cancer types tend to have unlike mutation spectra (mutation signature). [33] An underestimation of the number of silent mutations in a sample can lead to false positives past MutsigCV but not by our analysis. In our analysis of LUAD data, MutsigCV identified "Chromosome 3 Open Reading Frame 27" (C3orf27) equally statistically significant with adjusted P-value of 0.02. The C3orf27 is an unexpected candidate: it is a small gene with no testify reported to appointment that it is cancer related. There are no reported silent mutations in the gene in LUAD sample which implies that the overall mutability of the gene is low suggesting not-silent mutations in the gene are cancer related. Based on COSMIC data, C3orf27 has a ratio of silent to non-silent mutations of 0.21, which does not differ significantly from the average ratio of 0.34. In our regression model C3orf27 was not significant. Therefore, the total number of silent mutations per gene generated by whole genome (exome) mutational screens beyond dissimilar cancer types is a key predictor of somatic mutations and needs to be included in cancer gene prediction models including MutsigCV to increase the specificity of the results.

We found that top predictors for missense, nonsense and FS mutations are different. Equally a result, the mutation-type specific prediction models work better for identification of cancer-associated genes compared to the pan-mutation model. Though the pan-mutation model performs acceptably in predicting the number of missense mutations, its prediction accurateness for nonsense and frameshift mutations is poor compared to the mutation-specific models.

Conclusions

Nosotros analyzed a number of gene characteristics associated with missense, nonsense, and frameshift mutations. We practical stepwise all-time subset multivariate model to predict missense, nonsense, and FS mutations using gene characteristics, and past comparison of the observed and expected number of mutations identified novel cancer-associated genes. Nosotros showed that including the total number of silent mutations per gene identified by whole genome/exome screens across different cancer types led to a substantial improvement in the prediction efficacy, indicating that this variable needs to be included in existing prediction algorithms, e.g. MutsigCV. We also generated a list of novel candidate cancer-associated genes that may warrant further analysis.

Abbreviations

CCLE:

Cancer Cell Line Encyclopedia

Catholic:

Catalog of Somatic Mutations in Cancer

FS:

Frameshift mutations

LUAD:

Lung adenocarcinoma

LUSC:

Lung squamous cell carcinoma

ND:

Nucleotide diversity

OG:

Oncogene

SKCM:

Skin cutaneous melanoma

SNS:

Single nucleotide exchange

TS:

Tumor suppressors

References

  1. Adzhubei IA, Schmidt S, Peshkin L, Ramensky VE, Gerasimova A, Bork P, Kondrashov Every bit, Sunyaev SR. A method and server for predicting damaging missense mutations. Nat Methods. 2010;7(4):248–nine.

    Article  CAS  Google Scholar

  2. Bromberg Y, Yachdav K, Rost B. SNAP predicts effect of mutations on protein role. Bioinformatics. 2008;24(xx):2397–viii.

    Article  CAS  Google Scholar

  3. Capriotti Due east, Fariselli P, Rossi I, Casadio R. A three-land prediction of unmarried signal mutations on poly peptide stability changes. BMC Bioinformatics. 2008;nine(Suppl 2):S6.

    Article  Google Scholar

  4. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499(7457):214–viii.

    Commodity  CAS  Google Scholar

  5. Rogers MF, Shihab HA, Mort G, Cooper DN, Gaunt TR, Campbell C. FATHMM-XF: accurate prediction of pathogenic indicate mutations via extended features. Bioinformatics. 2018;34(3):511–3.

    Article  Google Scholar

  6. Schaefer C, Bromberg Y, Achten D, Rost B. Affliction-related mutations predicted to bear on protein function. BMC Genomics. 2012;13(Suppl iv):S11.

    Article  CAS  Google Scholar

  7. Wainreb G, Ashkenazy H, Bromberg Y, Starovolsky-Shitrit A, Haliloglu T, Ruppin E, Avraham KB, Rost B, Ben-Tal N. MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data. Nucleic Acids Res. 2010;38(Web Server result):W523–8.

    Article  CAS  Google Scholar

  8. Watson IR, Takahashi K, Futreal PA, Chin Fifty. Emerging patterns of somatic mutations in cancer. Nat Rev Genet. 2013;14(10):703–18.

    Article  CAS  Google Scholar

  9. Black RC, Khurshid H. NSCLC: an update of driver mutations, their role in pathogenesis and clinical significance. R I Med J (2013). 2015;98(10):25–8.

    Google Scholar

  10. Pon JR, Marra MA. Commuter and passenger mutations in cancer. Annu Rev Pathol. 2015;10:25–l.

    Article  CAS  Google Scholar

  11. Hodis E, Watson IR, Kryukov GV, Arold ST, Imielinski M, Theurillat JP, Nickerson Eastward, Auclair D, Li L, Place C, et al. A landscape of driver mutations in melanoma. Jail cell. 2012;150(2):251–63.

    Article  CAS  Google Scholar

  12. Park C, Qian W, Zhang J. Genomic prove for elevated mutation rates in highly expressed genes. EMBO Rep. 2012;13(12):1123-9.

    Article  CAS  Google Scholar

  13. Thurman RE, Rynes Due east, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, et al. The accessible chromatin mural of the human genome. Nature. 2012;489(7414):75–82.

    Article  CAS  Google Scholar

  14. Millar CB, Guy J, Sansom OJ, Selfridge J, MacDougall E, Hendrich B, Keightley PD, Bishop SM, Clarke AR, Bird A. Enhanced CpG mutability and tumorigenesis in MBD4-deficient mice. Science. 2002;297(5580):403–v.

    Article  CAS  Google Scholar

  15. Duret L, Mouchiroud D. Expression design and, surprisingly, factor length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc Natl Acad Sci U S A. 1999;96(8):4482–vii.

    Article  CAS  Google Scholar

  16. Muller CA, Nieduszynski CA. Dna replication timing influences gene expression level. J Cell Biol. 2017;216(vii):1907–fourteen.

    Commodity  CAS  Google Scholar

  17. Gorlova O, Fedorov A, Logothetis C, Amos C, Gorlov I. Genes with a large intronic burden prove greater evolutionary conservation on the protein level. BMC Evol Biol. 2014;xiv(1):l.

    Commodity  Google Scholar

  18. Martincorena I, Raine KM, Gerstung Chiliad, Dawson KJ, Haase One thousand, Van Loo P, Davies H, Stratton MR, Campbell PJ: Universal patterns of selection in Cancer and somatic tissues. Prison cell 2017, 171(5):1029–1041 e1021.

  19. Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, et al. The consensus coding sequence (CCDS) project: identifying a mutual protein-coding factor ready for the human and mouse genomes. Genome Res. 2009;19(7):1316–23.

    Commodity  CAS  Google Scholar

  20. Michaelson JJ, Shi Y, Gujral M, Zheng H, Malhotra D, Jin X, Jian Chiliad, Liu G, Greer D, Bhandari A, et al. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation. Jail cell. 2012;151(7):1431–42.

    Article  CAS  Google Scholar

  21. Gorlov IP, Moore JH, Peng B, Jin JL, Gorlova OY, Amos CI. SNP characteristics predict replication success in association studies. Hum Genet. 2014;133(12):1477–86.

    Article  Google Scholar

  22. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, Wilson CJ, Lehar J, Kryukov GV, Sonkin D, et al. The Cancer jail cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483(7391):603–7.

    Article  CAS  Google Scholar

  23. Genomes Project C, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.

    Article  Google Scholar

  24. Fernandez-Calero T, Cabrera-Cabrera F, Ehrlich R, Marin 1000. Silent polymorphisms: tin the tRNA population explicate changes in poly peptide properties? Life (Basel). 2016;6(ane).

    Article  Google Scholar

  25. Ryba T, Battaglia D, Chang BH, Shirley JW, Buckley Q, Pope BD, Devidas M, Druker BJ, Gilbert DM. Abnormal developmental control of replication-timing domains in pediatric astute lymphoblastic leukemia. Genome Res. 2012;22(ten):1833–44.

    Article  CAS  Google Scholar

  26. Sos BC, Fung HL, Gao DR, Osothprarop TF, Kia A, He MM, Zhang Thou. Characterization of chromatin accessibility with a transposome hypersensitive sites sequencing (THS-seq) assay. Genome Biol. 2016;17:20.

    Article  Google Scholar

  27. Tokheim CJ, Papadopoulos Due north, Kinzler KW, Vogelstein B, Karchin R. Evaluating the evaluation of cancer driver genes. Proc Natl Acad Sci U S A. 2016;113(50):14330–v.

    Article  CAS  Google Scholar

  28. Boutet E, Lieberherr D, Tognolli Thou, Schneider Thou, Bansal P, Bridge AJ, Poux Southward, Bougueleret L, Xenarios I. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to utilise the entry view. Methods Mol Biol. 2016;1374:23–54.

    Article  CAS  Google Scholar

  29. Breuza 50, Poux S, Estreicher A, Famiglietti ML, Magrane Grand, Tognolli M, Bridge A, Baratin D, Redaschi N, UniProt C. The UniProtKB guide to the man proteome. Database (Oxford). 2016;20(1):1-10.

  30. Alvarez-Cubero MJ, Martinez-Gonzalez LJ, Robles-Fernandez I, Martinez-Herrera J, Garcia-Rodriguez Thou, Pascual-Geler M, Cozar JM, Lorente JA. Somatic mutations in prostate Cancer: closer to personalized medicine. Mol Diagn Ther. 2016;21(ii):167-178.

    Commodity  Google Scholar

  31. Lipsyc M, Yaeger R. Impact of somatic mutations on patterns of metastasis in colorectal cancer. J Gastrointest Oncol. 2015;6(six):645–9.

    PubMed  PubMed Central  Google Scholar

  32. Rathi 5, Wright Thousand, Constantin D, Chang S, Pham H, Jones K, Palios A, McLachlan SA, Conron One thousand, McKelvie P, et al. Clinical validation of the 50 gene AmpliSeq Cancer panel V2 for apply on a side by side generation sequencing platform using formalin fixed, methane series embedded and fine needle aspiration tumour specimens. Pathology. 2017;49(1):75–82.

    Article  CAS  Google Scholar

  33. Alexandrov LB, Nik-Zainal S, Wedge DC, Aparicio SA, Behjati S, Biankin AV, Bignell GR, Bolli N, Borg A, Borresen-Dale AL, et al. Signatures of mutational processes in human cancer. Nature. 2013;500(7463):415–21.

    Article  CAS  Google Scholar

Download references

Acknowledgements

We would like to thank Dr. Chao Cheng for helpful suggestions.

Funding

This work was supported in part by the National Institutes of Health [U19 CA148127 to C.I.A., 1R01 CA149462 to O.Y.Thousand., K01LM012426 to H.R.F., P01 CA206980-01A1 subcontract to I.P.G., 1R56LM12371-01A1 to I.P.Yard.], Institutional Prouty Grant, Dartmouth Higher to I.O.K. and by the National Science Foundation [DMS-1361411 to Chiliad.Chiliad.]. The funders had no role in study design, information collection and analysis, conclusion to publish, or preparation of the manuscript.

Availability of data and materials

The datasets used and/or analyzed during the current report are available from the corresponding author on request.

Author information

Authors and Affiliations

Contributions

Conception and pattern: IG, CA, OG. Development of methodology: IG, CP, HF, SH, DWB, MK, OG, CA. Acquisition of information: IG, CP, SS, SH. Writing, review, and/or revision of the manuscript: IG, CA, OG, MK, MC. All authors read and approved the concluding manuscript.

Corresponding author

Correspondence to Ivan P. Gorlov.

Ethics declarations

Ideals approving and consent to participate

Not applicable.

Consent for publication

Non applicative.

Competing interests

The authors declare that they have no competing interests.

Publisher'south Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Boosted files

Additional file 1:

The relationship between the number of potential sites for a given type of mutations and the number of the mutations of the aforementioned type. Equally expected, there is a strong positive clan between the number of potential sites and the number of reported mutations similar to what was observed for the gene size. (DOCX 1370 kb)

Additional file ii:

Olfactory genes and densities of missense mutations. First row shows the proportion of olfactory genes in a sliding window of 100 genes, when it moves from everyman to highest nucleotide content. 2nd row shows the distribution of olfactory (red) and other (blue) genes across mutation densities bins. Olfactory genes take a higher density of missense but non nonsense mutations. Third row shows the outcome of excluding of olfactory genes on the relationship between percentage of "T" in the gene and density of missense mutations. (DOCX 2413 kb)

Boosted file 3:

The relationship between the proportion of CpG sites and the mutation densities. Proportion of CpGs was computed as the ratio of the number of CpGs in the cistron to the gene size in nucleotides. (DOCX 143 kb)

Additional file 4:

The relationship between conservation index and the mutation density. (DOCX 1044 kb)

Additional file 5:

The relationship between nucleotide diversity of the cistron sequences and the densities of somatic mutations. (DOCX 146 kb)

Boosted file 6:

The human relationship between chromatin accessibility and the mutation densities for missense, nonsense and frameshift mutations. (DOCX 130 kb)

Additional file vii:

The relationship between the observed and expected number of missense mutations. Each dot represents a cistron. (DOCX 566 kb)

Additional file 8:

The relationship between the observed and expected number of nonsense mutations. Each dot represents a cistron. (DOCX 654 kb)

Boosted file nine:

The human relationship between the observed and expected number of frameshift mutations. Each dot represents a gene. (DOCX 654 kb)

Boosted file 10:

Genes with a college than expected number of frameshift, missense, or nonsense mutations. Genes sorted on the maximum Z value. (DOCX 51 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Artistic Commons Attribution four.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give advisable credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zilch/1.0/) applies to the data fabricated available in this article, unless otherwise stated.

Reprints and Permissions

Well-nigh this article

Verify currency and authenticity via CrossMark

Cite this commodity

Gorlov, I.P., Pikielny, C.Westward., Frost, H.R. et al. Gene characteristics predicting missense, nonsense and frameshift mutations in tumor samples. BMC Bioinformatics 19, 430 (2018). https://doi.org/10.1186/s12859-018-2455-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI : https://doi.org/x.1186/s12859-018-2455-0

Keywords

  • Catalog of somatic mutations in Cancer
  • Catholic
  • Somatic mutations
  • Missense
  • Nonsense
  • Frameshift mutations
  • Cancer genes

Silent Missense And Nonsense Mutations,

Source: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2455-0

Posted by: handyowly1985.blogspot.com

0 Response to "Silent Missense And Nonsense Mutations"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel