The domestication of cultivated strawberry has followed a path different from that of other horticulturally important species, many of which were domesticated over millennia and traced to early civilizations, e.g., apple , olive , and wine grape . Although the octoploid progenitors were cultivated before the emergence of F. ananassa, the full extent of their cultivation is unclear and neither appears to have been intensely domesticated; e.g., Hardigan et al. did not observe changes in the genetic structure between land races and wild ecotypes of F. chiloensis, a species cultivated in Chile for at least 1,000 years . With less than 300 years of breeding, pedigrees for thousands of F. ananassa individuals have been recorded, albeit in disparate sources. To delve more deeply into the domestication history of cultivated strawberry, we assembled pedigree records from hundreds of sources and reconstructed the genealogy as deeply as possible. One of our initial motives for reconstructing the genealogy of cultivated strawberry was to identify historically important and genetically prominent ancestors of domesticated populations, in large part to guide the selection of individuals for whole-genome shotgun sequencing and DNA variant discovery, inform the development of single-nucleotide polymorphism genotyping platforms populated with octoploid genome-anchored subgenome-specific assays, and identify individuals for inclusion in genome-wide studies of biodiversity and population structure . The genetic relationships and genetic contributions of ancestors uncovered in the genealogy study described here guided the selection of individuals for downstream genomic studies that shed light on genetic variation and the genetic structure of domesticated populations worldwide .
Our other early motive for reconstructing the genealogy of strawberry was to support the curation and stewardship of a historically and commercially important germplasm collection preserved at the University of California, Davis ,black plastic plant pots with accessions tracing to the early origins of the strawberry breeding program at the University of California, Berkeley , in the 1920s . We sought to develop a complete picture of genetic relationships among living and extinct individuals in the California and worldwide populations, in part to assess how extinct individuals relate to living individuals preserved in public germplasm collections. Because 80% or more of the individuals we documented in the genealogy appear to be extinct, they could only be connected to living individuals through their pedigrees. One of the ways we explored ancestral interconnections between extinct and living individuals was through multivariate analyses of a combined pedigree–genomic relationship matrix estimated from genotyped and ungenotyped individuals . The holdings and history of the UCD Strawberry Germplasm Collection were shrouded in mystery when our study was initiated in 2015. The immediate challenge we faced in reconstructing the genealogy was the absence of pedigree records for 96% of the 1,287 accessions preserved in the collection, which is hereafter identified as the “California” population. To solve this problem, authenticate pedigrees, and fully reconstruct the genealogy of the California population, we applied an exclusion analysis in combination with high-density SNP genotyping . Here, we demonstrate the exceptional accuracy of diploid paternity analysis methods when applied to individuals in an allooctoploid organism genotyped with subgenome-specific SNPs on high-density arrays .
Several thousand SNP markers common to the three arrays were integrated to develop a SNP profile database for the parentage analyses described here. SNPs on the 50-K and 850-K arrays are uniformly distributed across the octoploid genome and informative in octoploid populations worldwide . The 50-K SNP array harbors 1 SNP/16,200 bp, whereas the 850-K array harbors 1 SNP/953 bp, telomere-to-telomere across the 0.81-Gb octoploid genome. The genealogies of domesticated plants, especially those with long-lived individuals, overlapping generations, and extensive migration and admixture, can be challenging to visualize and comprehend . We used Helium to visualize smaller targeted pedigrees; however, the strawberry pedigree networks we constructed and investigated were too large and mathematically complex to be effectively visualized and analyzed with Helium and other traditional hierarchical pedigree visualization approaches. Hierarchical methods often produce comprehensible insights and graphs when applied to pedigrees of individuals or small groups but yield exceedingly complex,labyrinthine graphs that are difficult to interpret when the genealogy contains a large number of individuals and lineages. We turned to social network analysis to explore alternative approaches to search for patterns and extract information from the complex genealogy of strawberry. The pedigree networks of plants and animals share many of the features of social networks with nodes connected to one another through edges relationships. We used SNA methods, in combination with classic population genetic methods, to analyze the genealogy and develop deeper insights into the domestication history of strawberry . SNA approaches have been applied in diverse fields of study but have apparently not been applied to the problem of analyzing and characterizing pedigree networks . With SNA, narrative data are translated into relational data and summary statistics and visualized as sociograms . Here, we report insights gained from genealogical studies of domesticated strawberry populations worldwide.
Our studies shed light on the complex wild ancestry of F. ananassa, the diversity of founders of domesticated populations of cultivated strawberry that have emerged over the past 300 years, and genetic relationships among extinct and extant ancestors in demographically unique domesticated populations tracing to the earliest ancestors and interspecific hybrids .To develop a SNP profile database for DNA forensic and population genetic analyses , we recalled and reanalyzed SNP marker genotypes for 1,495 individuals, including 1,235 UCD and 260 USDA accessions previously genotyped by Hardigan et al. with the iStraw35 SNP array . SNP marker genotypes were automatically called with the Affymetrix Axiom Analysis Suite . DNA samples with > 6% missing data were dropped from our analyses. This yielded 14,650 high-confidence codominant SNP markers for paternity–maternity analyses. While SNP markers are codominant by definition, a certain percentage of the SNP markers assayed in a population produce genotypic clusters lacking one of the homozygous genotypic clusters. These so-called no minor homozygote SNP markers were excluded from our analyses.For a second DNA forensic analysis, 1,561 UCD individuals were genotyped with 50-K or 850-K SNP arrays . This study population included 560 hybrid offspring from crosses among 27 elite UCD parents, the F. ananassa cultivar “Puget Reliance,” and the F. chiloensis subsp. lucida ecotypes “Del Norte” and “Oso Flaco.” Hardigan et al. included 16,554 SNP markers from the iStraw35 and iStraw90 SNP arrays on the 850-K SNP array. To build a SNP profile database for the second paternity–maternity analysis, we identified 2,615 SNP markers that were common to the three arrays and produced well-separated codominant genotypic clusters with high confidence scores . We subdivided the global population into “California” and “Cosmopolitan” populations, in addition to continent-, region-, or country-specific populations, for different statistical analyses. These subdivisions are documented in the pedigree database . The California population included 100% of the UCD individuals from the global population, in addition to 262 non-California individuals that were ascendants of UCD individuals. The Cosmopolitan population included 100% of the non-California individuals ,black plastic planting pots in addition to 160 California individuals that were ascendants of non-California individuals. We subdivided individuals in the US population into Midwestern, Northeastern, Southern, and Western US populations. The Western US population included only those UCD individuals that were ascendants in the pedigrees of Western US individuals. The country-specific subdivisions were Australia, China, Japan, South Korea, Belgium, Czechoslovakia, Denmark, England, Finland, France, Germany, Israel, Italy, the Netherlands, Norway, Poland, Russia, Scotland, Spain, Sweden, and Canada.DTR and TTR statistics were estimated from equations and using custom R code that we developed and provided as supplemental material . DTR estimates for PO duos and TTR estimates for PPO trios were compared to empirically estimate statistical thresholds to exclude parents. With a perfect dataset , TTR ¼ 0 when both parents in a trio are correctly identified. When estimated from a real-world dataset , TTR ¼ 0 even when both parents in the trio are correctly identified. However, TTR estimates for correctly identified parents are typically exceedingly small and approach zero when genotyping errors are small and DNA profiles are informative . The probability of a type I error depends on the genetic relatedness of individuals in the DNA profile database and the number, informativeness, and genotyping error rates of the DNA markers . A false-positive error occurs when an individual that is not a parent is declared to be a parent , whereas a false negative error occurs when an individual that is a parent is excluded . DTR and TTR thresholds for excluding parents were empirically estimated by bootstrapping .
We drew 50,000 bootstrap samples from a population of 1,002 individuals with known pedigrees by replacing one or both parents in the known PPO trio with a randomly selected individual from the population. We built empirical DTR and TTR distributions from the 50,000 estimates and ascertained the statistical thresholds needed to accurately identify parents, exclude non-parents, and minimize false-positive errors. The bootstrap estimated DTR threshold of DTR 0:0016 yielded a false-positive probability of zero and a false-negative probability of 5%, whereas the bootstrap-estimated TTR threshold of TTR 0:01 yielded false-positive and false-negative probabilities of zero. Thesethresholds were estimated by summing transgression scores summed over 14,650 SNP marker loci. To increase the computational speed and efficiency, DTR statistics were estimated for every PO combination, whereas TTR statistics were only estimated for PPO combinations where the DTR estimates for both parents were less than the empirical threshold . This was done because the number of PPO combinations was prohibitively large and most PPO combinations could be unequivocally excluded using DTR estimates.We reconstructed the genealogy of cultivated strawberry as deeply as possible from wild founders to modern cultivars . To build the database, pedigree records for 8,851 individuals were assembled from more than 800 documents, including scientific and popular press articles, laboratory notebooks, garden catalogs, cultivar releases, plant patent databases, and germplasm repository databases . The database holds pedigree records and passport data for 2,656 F. ananassa cultivars, of which approximately 310 were private sector cultivars with pedigree records in public patent databases . The parents of the private sector cultivars, however, were nearly always identified by cryptic alphanumeric codes, and thus could not be integrated into the “giant component” of the sociogram . Our computer forensic search did not recover pedigree records for 220 individuals in the California population; however,we suspected that their parents might be present in the SNP profile database. Using duo and trio exclusion analyses, we identified both parents for 214 of these individuals and one parent each for the other six individuals. Hence, using a combination of computer and DNA forensic approaches, 2,222 out of 2,470 possible parents of 1,235 individuals in the California population were identified and documented in the pedigree database . The parents declared in pedigree records , identified by DNA forensic methods , or both are documented in the pedigree database . Despite their historic and economic importance, the pedigrees of individuals preserved in the UCD Strawberry Germplasm Collection had not been previously documented. Besides reconstructing the genealogy of the California population, previously undocumented pedigrees of extinct and extant individuals were discovered in the laboratory notebooks of Harold E. Thomas, Royce S. Bringhurst, and others , and integrated into the pedigree database . To further validate the accuracy of DNA forensic approaches for parent identification in strawberry, we applied an exclusion analysis to a population of 560 hybrid individuals developed from crosses among 30 UCD individuals . The pedigrees of the parents and hybrids were known. The parents and hybrids and 1,561 additional UCD individuals were genotyped with 50-K or 850-K SNP arrays . The 50-K array was developed with SNP markers from the 850-K array , which included a subset of 16,554 legacy SNP markers from the iStraw35 and iStraw90 arrays . We developed an integrated SNP profile database using 2,615 SNP markers common to the three arrays. Using PPO trios, we discovered that the SNP profile for one of the parents was a mismatch, whereas the SNP profiles of the other 29 parents perfectly matched their pedigree records. We discovered that the parent stated for 11C151P008 was correct, but that the DNA sample and associated SNP marker profile were incorrect. Hence, the DNA sample mismatch was traced by trio exclusion analysis to a single easily corrected laboratory error. This analysis empirically demonstrated the utility of exclusion analysis for authenticating pedigrees and curating germplasm collections, and showed that parents can be accurately identified with substantially smaller numbers of DNA markers than those applied in our initial study.