Mutations underlying phenotypic variation remain elusive in trait-mapping studies1 despite the exponential accumulation of genomic data, suggesting that many causal variants are invisible to current genotyping approaches2,3,4,5. In fact, mutations like duplications, deletions, and transpositions6,7 are systematically under-represented by standard methods7, even as a consensus emerges that such structural variants (SVs) are important factors in the genetics of complex traits2. Addressing this problem requires compiling an accurate and complete catalog of the genomic features that are relevant to phenotypic variation, a goal most readily achieved by comparing nearly complete high-quality genomes7. Although the development of high-throughput short-read sequencing led to a steep drop in cost and a commensurate increase in the pace of sequencing8, it also led to a focus on single-nucleotide changes and small indels3,9. Paradoxically, this has also resulted in deterioration of the contiguity and completeness of new genome assemblies, due primarily to read-length limitations10.
Here we present a reference-quality assembly of a second D. melanogaster strain called A4 and introduce a comprehensive map of SVs, which identifies a large amount of hidden variation exceeding that due to SNPs and small indels, and which includes strong candidates to explain complex traits. The A4 strain is a part of the Drosophila Synthetic Population Resource (DSPR)11, a resource for mapping phenotypically relevant variants. We assembled the new A4 genome using high-coverage (147×) long reads through single-molecule real-time sequencing of DNA extracted from females (Supplementary Fig. 1), following an approach that has been shown to yield complete and contiguous assemblies12. The A4 assembly is more contiguous than release 6 of the ISO1 strain13—which is arguably the best metazoan whole-genome sequence assembly—with 50% of the genome contained in contiguous sequences (contigs) 22.3 Mb in length or longer (Supplementary Figs. 2 and 3). As compared to the ISO1 assembly, the A4 assembly comprises far fewer sequences (161 scaffolds versus 1,857 non-Y-chromosome scaffolds14) while maintaining comparable completeness (Supplementary Table 1)15. The two genomes are collinear across all major chromosome arms, making large-scale misassembly unlikely (Fig. 1a). An optical map of the A4 genome also supported the accuracy of the assembly (Supplementary Figs. 4 and 5).
We identified putative SVs by classifying regions of disagreement in a genome-wide pairwise alignment of the A4 and ISO1 assemblies as indels, copy number variants (CNVs), or inversions (Table 1). Reads spanning SVs showed that genotyping error was rare (<2.5%; Supplementary Table 2). However, because extremely long repeats are common in heterochromatin and require specialized approaches for assembly and validation16, we focused on euchromatin (Supplementary Table 3). We discovered 1,890 large (>100-bp) indels (Supplementary Fig. 6 and Supplementary Table 4), which affected more than 7 Mb. In contrast, mutations <100 bp in length affected only 1.4 Mb (indels, 722 kb; SNPs, 687 kb). Among large indels, 79% (1,486/1,890) were transposable element (TE) insertions (Supplementary Figs. 7–17). A previously published catalog of TE insertions in A4 based on 70× short-read coverage17 failed to find 38% of the TE insertions in A4 reported here (Fig. 1b, Supplementary Fig. 18, and Supplementary Table 5). These insertions, which are invisible to short-read approaches, often occur (in 34% of instances) when a TE is inserted near another TE, resulting in complex, non-uniquely mapping reads that are difficult to interpret. One such insertion was found in the A4 allele of the MRP gene (encoding multidrug-resistance-like protein 1), which is a candidate gene for resistance to the chemotherapy drug carboplatin18 (Supplementary Fig. 17).
We found that many TE insertions affected introns (395/718 in ISO1, 435/768 in A4), often greatly lengthening them (Fig. 1c and Supplementary Fig. 19). Additionally, TEs inserted into exons can be spliced out, effectively becoming new introns. We saw evidence of this in cDNA from ISO119 and in RNA-seq reads in A4 that showed exon junctions flankng TE insertions (Supplementary Figs. 20–22 and Supplementary Table 6), which represents a genome-wide view of TE-derived introns segregating in a population. TE insertions within introns are associated with decreased transcription20, possibly caused by a phenomenon called intron delay, which slows transcription in long introns21. TE insertions can affect phenotype directly22, perhaps by modulating or disrupting the expression of important genes. Because most TEs are rare in D. melanogaster23, they are poorly tagged by common variants, complicating genome-wide association study (GWAS) approaches for mapping traits; this mirrors similar complications in human GWAS24.
Non-TE insertions represented 20% of ISO1 and 23% of A4 insertions, and they accounted for 170 kb of sequence variation (Fig. 1d and Table 1). Although these mutations were much smaller than TEs (median 213 bp versus 4.7 kb), they often affected genes, and 23% even escaped detection by short reads (Fig. 1b). For example, among both hidden and visible deletions, there were 18 genes that were present in ISO1 and partially or completely absent in A4 (Supplementary Table 7), including Cyp6a17 (Fig. 2a and Supplementary Fig. 23). Knockout of Cyp6a17 in a previous study increased cold preference25. Indeed, A4 flies preferred colder temperatures than flies from a strain carrying an intact copy of Cyp6a17 (Fig. 2b and Supplementary Fig. 24). Furthermore, this mutation was more common than expected for a deleterious allele (Fig. 2c), suggesting that it has a role in regulating how flies respond to temperature in the wild. One deletion missed by short-read genotyping removed the second exon of Mur18B (and 41 amino acids of the encoded chitin-binding protein that confers resistance to high-temperature stress26) (Supplementary Fig. 25), likely rendering the A4 Mur18B allele defective.
We discovered 27 inversions, ranging from 100 bp to 21 kb in length (Supplementary Table 4), that affected 60 kb of sequence, only 4 of which were detected by paired-end methods (Fig. 1b and Supplementary Table 5). These inversions often (in 21/27 instances) affected regions harboring genes, including a 21-kb region that spanned five genes encoding gustatory receptors: Gr22a, Gr22b, Gr22c, Gr22d, and Gr22e (Supplementary Table 4). Although such clusters of related sequences may obscure the read-mapping information used to detect inversions, we could not find genomic features that might explain why the other inversions were missed. The A4 optical map identified a putative inversion occupying 300 kb of the proximal end of the X-chromosome scaffold that was not resolved by the A4 assembly (Supplementary Figs. 4 and 5). Failure to resolve this inversion is not unexpected because assembly methods tuned for euchromatin perform poorly in heterochromatic regions16.
We discovered 390 CNVs (209 in A4 and 181 in ISO1) that affected ~600 kb (Fig. 1d, Supplementary Figs. 26–36, and Supplementary Table 4). Although some CNVs were missed by paired-end methods owing to spacer sequences between copies that were longer than the library fragments (Fig. 3a,d), most (~90%) of the CNVs were missed because they occurred in complex tandem repeats (Supplementary Fig. 37). Unlike indels, most CNVs (64%) affected exons. Additionally, short-read CNV genotyping methods missed 13 of 34 protein-coding genes that were duplicated in A4. In total, only ~40% of CNVs were discoverable with high-specificity split-read and read-orientation methods27,28 (Fig. 1b and Supplementary Fig. 38). Consistent with previous observations29, coverage-based methods were extremely nonspecific (Supplementary Fig. 38) and were therefore excluded from analysis. We next compared published gene expression data from larvae of A4 to expression data for a DSPR strain called A330 and identified 17 A4 duplicate genes that are single copy in ISO1 with increased expression (Supplementary Table 8), including genes previously identified as candidates for cold adaptation, olfactory response, and toxin resistance, among others (Fig. 3a,d and Supplementary Tables 8 and 9). Notably, eight of these CNVs were invisible to short-read methods (Supplementary Table 8).
A longstanding concern in trait-mapping studies is failure to genotype candidate mutations2. Because A4 is a parental line of the DSPR trait-mapping panel11, we could confront this problem directly. Among the eight duplicate genes with increased expression in A4 that escaped detection, Cyp28d1 and Ugt86Dh fell under quantitative trait loci (QTLs) for resistance to nicotine, a plant defense toxin30,31. One QTL (Q1) contains two genes, Cyp28d1 and Cyp28d2, that encode cytochrome P450 enzymes, both of which were upregulated30. The other candidate region that showed a major effect contains the Ugt86D gene cluster, which includes several differentially regulated genes, including Ugt86Dh (Fig. 3d,e). Candidate mutations like these are of obvious interest to researchers trying to dissect any trait, and yet they were not visible in the initial study30.
In the A4 assembly, Q1 contains a 3,755-bp tandem duplication in which the duplicated regions are separated by a 1.5-kb spacer, resulting in two copies of Cyp28d1 (Fig. 3a and Supplementary Figs. 39–41). We compared paralog-specific expression levels of the Cyp28d1 copies in A4 to expression of the single copy in A3. In the absence of nicotine, the proximal and distal copies in A4 exhibited ~41-fold and ~6.3-fold higher expression, respectively, than the single copy in A3 (Fig. 3b). The intervening spacer sequence proved to be the 5′ end of Accord, a long terminal repeat (LTR) retrotransposon (Fig. 3a). Insertion of Accord upstream of another gene called Cyp6g1 has been linked to upregulation of the encoded cytochrome P450 enzyme32, suggesting that the retrotransposon may be responsible for the upregulated expression rather than the tandem duplication of the Cyp28d gene. The second nicotine-resistance QTL contains several Ugt genes, including Ugt86Dh, which have previously been implicated in increased resistance to the pesticide DDT33. Of note, we found that Ugt86Dh was duplicated in A4 (Fig. 3d and Supplementary Figs. 42 and 43); this mutation escaped detection by paired-end short reads (Supplementary Table 5). Although several Ugt genes in the Q4 QTL showed higher expression in nicotine-resistant A4 larvae than in sensitive A3 larvae30 (Fig. 3e), candidate variants that explain these differences have yet to be identified.
Because nicotine analogs are widely used pesticides, we predict that resistance-conferring mutations are common, mirroring observations for DDT. Indeed, we found that four duplicate alleles spanning Cyp28d1 and Cyp28d2 segregated at intermediate to high frequencies in multiple populations (Fig. 3c) in a 25-kb region where we expected duplicate heterozygosity to be less than 0.1. Similarly, the single duplicate allele of Ugt86Dh segregated at high or intermediate frequency in nearly all of the populations we examined6 (Fig. 3f). Finally, patterns of SNP variation surrounding both Cyp28d1 and Ugt86Dh are consistent with recent bouts of natural selection (Supplementary Figs. 44 and 45), suggesting recent adaptation to nicotinoids.
Although we focus on genetic variation in A4 relative to ISO1, there is no biologically meaningful sense in which any individual of a species is a more appropriate reference than another. Yet, despite the prevalence of heritable phenotypic variation, functional work often describes results derived from individuals with diverse genotypes as applying to an entire species34. Approaches like RNA interference (RNAi) or gene editing with CRISPR require precise sequence information about their targets and can be easily misled by hidden structural variation. One study on the origin of new genes in D. melanogaster argues that new genes rapidly become essential, and the authors even report a new gene called p24-2 that is so young that it is present in only D. melanogaster35. Experiments targeting p24-2 using RNAi constructs suggested that, although new, p24-2 is essential. However, p24-2 was absent in eight of the ten strains we examined, including A4 and Oregon-R (Supplementary Figs. 46 and 47), which calls into question its essential nature in D. melanogaster. Because the original construct actually targeted both p24-2 and its essential paralog eca36,37 (Supplementary Note), we tested two other constructs targeting p24-2, neither of which resulted in any reduction in viability (Supplementary Table 10), thus bolstering the suggestion that p24-2 is not essential.
The ubiquity of hidden variation in genome structure is merely an indication of the extent of the underlying genetic variation governing phenotypes. Together with careful phenotypic measurements, a new generation of high-quality genomes will identify previously invisible heritable phenotypic variation. Our results show that popular genotyping approaches miss a significant fraction of SVs (Fig. 1b, Supplementary Figs. 18 and 38, and Supplementary Table 5), including ones that affect gene expression and organismal phenotype (Supplementary Tables 8 and 9), suggesting that previous estimates of the contribution of SVs to regulatory38 and phenotypic variation are misleading39. The extensive hidden variation we observe segregates in D. melanogaster, a species that likely harbors fewer complex structural features than humans or livestock, as well as crop species like wheat and maize. Consequently, we suggest that the true medical and agricultural impact of structural variation is likely to be much greater than the already considerable estimates made without recourse to multiple reference-grade assemblies29.
The 15:1 phenotypic ratio resulting from the self cross suggests duplicate dominant genes without cumulative effect.
In other words, two genes with complete dominance and dominant alleles producing the same phenotype, but independently, meaning that dominance at one locus ~ dominance at the other locus ~ dominance at both loci as far as phenotype is concerned.
So, in the self cross:
AaBb x AaBb =>
9 A_B_ double dominance = dominance
3 A_bb dominance from A
3 aaB_ dominance from B
1 aabb recessive
So that's where you get your 15:1 dominant:recessive phenotypic ratio.
Now, with a testcross, you were on the right track:
AaBb x aabb =>
So, comparing this to the selfing results above, you'd get 3 dominant : 1 recessive for your phenotypic ratio.
answered Oct 26 '16 at 14:53