The boxplots show divergence time estimates (posterior medians) for SARS-CoV-2 (red) and the 20022003 SARS-CoV virus (blue) from their most closely related bat virus. 87, 62706282 (2013). & Bedford, T. MERS-CoV spillover at the camelhuman interface. Another similarity between SARS-CoV and SARS-CoV-2 is their divergence time (4070years ago) from currently known extant bat virus lineages (Fig. The command line tool is open source software available under the GNU General Public License v3.0. Add entries for pangolin-data/-assignment 1.18.1.1 (, Really add a document on testing strategy. Software package for assigning SARS-CoV-2 genome sequences to global lineages. Nevertheless, the viral population is largely spatially structured according to provinces in the south and southeast on one lineage, and provinces in the centre, east and northeast on another (Fig. While such models have recently been made available, we lack the information to calibrate the rate decline over time (for example, through internal node calibrations44). Biazzo et al. Lu, R. et al. Press, H.) 3964 (Springer, 2009). SARS-CoV-2 and RaTG13 are also exceptions because they were sampled from Hubei and Yunnan, respectively. Bayesian evaluation of temporal signal in measurably evolving populations. Google Scholar. Evol. D.L.R. Among the 68sequences in the aligned sarbecovirus sequence set, 67 show evidence of mosaicism (all DunnSidak-corrected P<4104 and 3SEQ14), indicating involvement in homologous recombination either directly with identifiable parentals or in their deeper shared evolutionary historythat is, due to shared ancestral recombination events. performed recombination analysis for non-recombining regions1 and 2, breakpoint analysis and phylogenetic inference on recombinant segments. To employ phylogenetic dating methods, recombinant regions of a 68-genome sarbecovirus alignment were removed with three independent methods. Yuan, J. et al. PANGOLIN lineage database (15, 16) was used to analyze the frequency of lineages among countries. Center for Infectious Disease Dynamics, Department of Biology, Pennsylvania State University, University Park, PA, USA, Department of Microbiology, Immunology and Transplantation, KU Leuven, Rega Institute, Leuven, Belgium, Department of Biological Sciences, Xian Jiaotong-Liverpool University, Suzhou, China, State Key Laboratory of Emerging Infectious Diseases, School of Public Health, The University of Hong Kong, Hong Kong SAR, China, Department of Biology, University of Texas Arlington, Arlington, TX, USA, Institute of Evolutionary Biology, University of Edinburgh, Edinburgh, UK, MRC-University of Glasgow Centre for Virus Research, Glasgow, UK, You can also search for this author in https://doi.org/10.1038/s41564-020-0771-4, DOI: https://doi.org/10.1038/s41564-020-0771-4. R. Soc. Katoh, K., Asimenos, G. & Toh, H. in Bioinformatics for DNA Sequence Analysis (ed. 6, e14 (2017). 36)gives a putative recombination-free alignment that we call non-recombinant alignment3 (NRA3) (see Methods). PLoS ONE 5, e10434 (2010). Identifying the origins of an emerging pathogen can be critical during the early stages of an outbreak, because it may allow for containment measures to be precisely targeted at a stage when the number of daily new infections is still low. This long divergence period suggests there are unsampled virus lineages circulating in horseshoe bats that have zoonotic potential due to the ancestral position of the human-adapted contact residues in the SARS-CoV-2 RBD. A deep dive into the genetics of the novel coronavirus shows it seems to have spent some time infecting both bats and pangolins before it jumped into humans, researchers said . Our third approach involved identifying breakpoints and masking minor recombinant regions (with gaps, which are treated as unobserved characters in probabilistic phylogenetic approaches). In our analyses of the sarbecovirus datasets, we incorporated the uncertainty of the sampling dates when exact dates were not available. 3). All three approaches to removal of recombinant genomic segments point to a single ancestral lineage for SARS-CoV-2 and RaTG13. The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus genome sequences. Lemey, P., Minin, V. N., Bielejec, F., Pond, S. L. K. & Suchard, M. A. b, Similarity plot between SARS-CoV-2 and several selected sequences including RaTG13 (black), SARS-CoV (pink) and two pangolin sequences (orange). Transparent bands of interquartile range width and with the same colours are superimposed to highlight the overlap between estimates. Current sampling of pangolins does not implicate them as an intermediate host. & Holmes, E. C. A genomic perspective on the origin and emergence of SARS-CoV-2. It is RaTG13 that is more divergent in the variable-loop region (Extended Data Fig. 2). The consistency of the posterior rates for the different prior means also implies that the data do contribute to the evolutionary rate estimate, despite the fact that a temporal signal was visually not apparent (Extended Data Fig. Nature 583, 286289 (2020). The coverage threshold and consensus sequence generation threshold were set to 20 and 90 respectively. 2). On first examination this would suggest that that SARS-CoV-2 is a recombinant of an ancestor of Pangolin-2019 and RaTG13, as proposed by others11,22. Evolutionary rate estimation can be profoundly affected by the presence of recombination50. 17, 15781579 (1999). 4. Microbiol. Our results indicate the presence of a single lineage circulating in bats with properties that allowed it to infect human cells, as previously described for bat sarbecoviruses related to the first SARS-CoV lineage29,30,31. In addition, sequences NC_014470 (Bulgaria 2008), CoVZXC21, CoVZC45 and DQ412042 (Hubei-Yichang) needed to be removed to maintain a clean non-recombinant signal in A. This provides compelling support for the SARS-CoV-2 lineage being the consequence of a direct or nearly-direct zoonotic jump from bats, because the key ACE2-binding residues were present in viruses circulating in bats. The variable-loop region in SARS-CoV-2 shows closer identity to the 2019 pangolin coronavirus sequence than to the RaTG13 bat virus, supported by phylogenetic inference (Fig. EPI_ISL_410721) and Beijing Institute of Microbiology and Epidemiology (W.-C. Cao, T.T.-Y.L., N. Jia, Y.-W. Zhang, J.-F. Jiang and B.-G. Jiang, nos. Scientists trying to trace the ancestry of SARS-CoV-2, the virus responsible for COVID-19, have found the pangolin is unlikely to be the source of the virus responsible for the current pandemic. RegionB is 5,525nt long. Hon, C. et al. Virus Evol. PubMed Central ac, Root-to-tip (RtT) divergence as a function of sampling time for the three coronavirus evolutionary histories unfolding over different timescales (HCoV-OC43 (n=37; a) MERS (n=35; b) and SARS (n=69; c)). Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia. CAS Hu, B. et al. Wan, Y., Shang, J., Graham, R., Baric, R. & Li, F. Receptor recognition by the novel Coronavirus from Wuhan: an analysis based on decade-long structural studies of SARS coronavirus. We extracted a similar number (n=35) of genomes from a MERS-CoV dataset analysed by Dudas et al.59 using the phylogenetic diversity analyser tool60 (v.0.5). PubMed Central PI signals were identified (with bootstrap support >80%) for seven of these eight breakpoints: positions 1,684, 3,046, 9,237, 11,885, 21,753, 22,773 and 24,628. 92, 433440 (2020). The inset represents divergence time estimates based on NRR1, NRR2 and NRA3. Conducting analogous analyses of codon usage bias as Ji et al. The shaded region corresponds to the Sprotein. From this perspective, it may be useful to perform surveillance for more closely related viruses to SARS-CoV-2 along the gradient from Yunnan to Hubei. Nature 579, 265269 (2020). Syst. Evol. The histogram allows for the identification of non-recombining regions (NRRs) by revealing regions with no breakpoints. The first available sequence data6 placed this novel human pathogen in the Sarbecovirus subgenus of Coronaviridae7, the same subgenus as the SARS virus that caused a global outbreak of >8,000 cases in 20022003. 31922087). GARD identified eight breakpoints that were also within 50nt of those identified by 3SEQ. Holmes, E. C., Rambaut, A. Biol. Phylogenetic trees and exact breakpoints for all ten BFRs are shown in Supplementary Figs. Bioinformatics 30, 13121313 (2014). In such cases, even moderate rate variation among long, deep phylogenetic branches will substantially impact expected root-to-tip divergences over a sampling time range that represents only a small fraction of the evolutionary history40. The Artic Network receives funding from the Wellcome Trust through project no. It allows a user to assign a SARS-CoV-2 genome sequence the most likely lineage (Pango lineage) to SARS-CoV-2 query sequences. 21, 255265 (2004). Sequences were aligned by MAFTT58 v.7.310, with a final alignment length of 30,927, and used in the analyses below. All custom code used in the manuscript is available at https://github.com/plemey/SARSCoV2origins. Because 3SEQ is the most statistically powerful of the mosaic methods61, we used it to identify the best-supported breakpoint history for each potential child (recombinant) sequence in the dataset. Temporal signal was tested using a recently developed marginal likelihood estimation procedure41 (Supplementary Table 1). Google Scholar. PubMed Central In the absence of any reasonable prior knowledge on the TMRCA of the sarbecovirus datasets (which is required for grid specification in a skygrid model), we specified a simpler constant size population prior. 26, 450452 (2020). Unlike other viruses that have emerged in the past two decades, coronaviruses are highly recombinogenic14,15,16. 3) to examine the sensitivity of date estimates to this prior specification. Ge, X. et al. We used an uncorrelated relaxed clock model with log-normal distribution for all datasets, except for the low-diversity SARS data for which we specified a strict molecular clock model. Divergence time estimates based on the three regions/alignments where the effects of recombination have been removed. Removal of five sequences that appear to be recombinants and two small subregions of BFRA was necessary to ensure that there were no phylogenetic incongruence signals among or within the three BFRs. Dudas, G., Carvalho, L. M., Rambaut, A. Proc. Curr. Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)? 30, 21962203 (2020). Epidemiology, genetic recombination, and pathogenesis of coronaviruses. 04:20. However, for several reasons, nucleotide sequences may be generated that cover only the spike gene of SARS-CoV-2. Combining regions A, B and C and removing the five named sequences gives us putative NRR1, as an alignment of 63sequences. Membrebe, J. V., Suchard, M. A., Rambaut, A., Baele, G. & Lemey, P. Bayesian inference of evolutionary histories under time-dependent substitution rates. Concurrent evidence also proposed pangolins as a potential intermediate species for SARS-CoV-2 emergence and suggested them as a potential reservoir species11,12,13. Nature 583, 282285 (2020). This boundary appears to be rarely crossed. Anderson, K. G. nCoV-2019 codon usage and reservoir (not snakes v2). Cell 181, 223227 (2020). Of importance for future spillover events is the appreciation that SARS-CoV-2 has emerged from the same horseshoe bat subgenus that harbours SARS-like coronaviruses. The existing diversity and dynamic process of recombination amongst lineages in the bat reservoir demonstrate how difficult it will be to identify viruses with potential to cause major human outbreaks before they emerge. J. Med. Genetics 172, 26652681 (2006). A new coronavirus associated with human respiratory disease in China. The relatively fast evolutionary rate means that it is most appropriate to estimate shallow nodes in the sarbecovirus evolutionary history. TMRCA estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent for the different data sets and different rate priors in our analyses. Viral metagenomics revealed Sendai virus and coronavirus infection of Malayan pangolins (Manis javanica). Open reading frames are shown above the breakpoint plot, with the variable-loop region indicated in the Sprotein. Evol. Provided by the Springer Nature SharedIt content-sharing initiative, Molecular and Cellular Biochemistry (2023), Nature Microbiology (Nat Microbiol) Specifically, using a formal Bayesian approach42 (see Methods), we estimate a fast evolutionary rate (0.00169 substitutions per siteyr1, 95% highest posterior density (HPD) interval (0.00131,0.00205)) for SARS viruses sampled over a limited timescale (1year), a slower rate (0.00078 (0.00063,0.00092) substitutions per siteyr1) for MERS-CoV on a timescale of about 4years and the slowest rate (0.00024 (0.00019,0.00029) substitutions per siteyr1) for HCoV-OC43 over almost five decades. The red and blue boxplots represent the divergence time estimates for SARS-CoV-2 (red) and the 2002-2003 SARS-CoV (blue) from their most closely related bat virus, with the light- and dark-colored versions based on the HCoV-OC43 and MERS-CoV centered priors, respectively. Evolutionary origins of the SARS-CoV-2 sarbecovirus lineage responsible for the COVID-19 pandemic, https://doi.org/10.1038/s41564-020-0771-4. Zhou et al.2 concluded from the genetic proximity of SARS-CoV-2 to RaTG13 that a bat origin for the current COVID-19 outbreak is probable. is funded by the MRC (no. Evol. CAS SARS-CoV-2 genetic lineages in the United States are routinely monitored through epidemiological investigations, virus genetic sequence-based surveillance, and laboratory studies. 110. N. Engl. After removal of A1 and A4, we named the new region A. The pangolin coronaviruses show lower similarity to SARS-CoV-2 than bat coronavirus RaTG13 across the whole genome, but higher similarity in the spike receptor binding domain, although the similarity at either scale remains too low to implicate . Abstract. The divergence time estimates for SARS-CoV-2 and SARS-CoV from their respective most closely related bat lineages are reasonably consistent among the three approaches we use to eliminate the effects of recombination in the alignment. Genet. Because the estimated rates and divergence dates were highly similar in the three datasets analysed, we conclude that our estimates are robust to the method of identifying a genomes NRRs. As informative rate priors for the analysis of the sarbecovirus datasets, we used two different normal prior distributions: one with a mean of 0.00078 and s.d. 1, vev003 (2015). Press, 2009). . There is a 90% DNA match between SARS CoV 2 and a coronavirus in pangolins. performed Srecombination analysis. We thank A. Chan and A. Irving for helpful comments on the manuscript. with an alignment on which an initial recombination analysis was done. J. Virol. Boni, M.F., Lemey, P., Jiang, X. et al. 3). Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Published. Natl Acad. Divergence time estimates based on the HCoV-OC43-centred rate prior for the separate BFRs (Supplementary Table 3) show consistency in TMRCA estimates across the genome. However, on closer inspection, the relative divergences in the phylogenetic tree (Fig. RegionsAC had similar phylogenetic relationships among the southern China bat viruses (Yunnan, Guangxi and Guizhou provinces), the Hong Kong viruses, northern Chinese viruses (Jilin, Shanxi, Hebei and Henan provinces, including Shaanxi), pangolin viruses and the SARS-CoV-2 lineage. We thank originating laboratories at South China Agricultural University (Y. Shen, L. Xiao and W. Chen; no. To examine temporal signal in the sequenced data, we plotted root-to-tip divergence against sampling time using TempEst39 v.1.5.3 based on a maximum likelihood tree. Proc. Forni, D., Cagliani, R., Clerici, M. & Sironi, M. Molecular evolution of human coronavirus genomes. The SARS-CoV divergence times are somewhat earlier than dates previously estimated15 because previous estimates were obtained using a collection of SARS-CoV genomes from human and civet hosts (as well as a few closely related bat genomes), which implies that evolutionary rates were predominantly informed by the short-term SARS outbreak scale and probably biased upwards. 1) and thus likely to be the product of recombination, acquiring a divergent variable loop from a hitherto unsampled bat sarbecovirus28. Sliding window analysis of changes in the patterns of sequence similarity between human SARS-CoV-2, and pangolin and bat coronaviruses as described further in Fig. It compares the new genome against the large, diverse population of sequenced strains using a Alternatively, combining 3SEQ-inferred breakpoints, GARD-inferred breakpoints and the necessity of PI signals for inferring recombination, we can use the 9.9-kb region spanning nucleotides 11,88521,753 (NRR2) as a putative non-recombining region; this approach is breakpoint-conservative because it is conservative in identifying breakpoints but not conservative in identifying non-recombining regions. Boni, M. F., Zhou, Y., Taubenberger, J. K. & Holmes, E. C. Homologous recombination is very rare or absent in human influenza A virus. 6, 8391 (2015). And this genotype pattern led to creating a new Pangolin lineage named B.1.640.2, a phylogenetic sister group to the old B.1.640 lineage renamed B.1.640.1. M.F.B. Eden, J.-S., Tanaka, M. M., Boni, M. F., Rawlinson, W. D. & White, P. A. Recombination within the pandemic norovirus GII.4 lineage. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. By 2009, however, rapid genomic analysis had become a routine component of outbreak response. Because there is no single accepted method of inferring breakpoints and identifying clean subregions with high certainty, we implemented several approaches to identifying three classic statistical signals of recombination: mosaicism, phylogenetic incongruence and excessive homoplasy51. In December 2019, a cluster of pneumonia cases epidemiologically linked to an open-air live animal market in the city of Wuhan (Hubei Province), China1,2 led local health officials to issue an epidemiological alert to the Chinese Center for Disease Control and Prevention and the World Health Organizations (WHO) China Country Office. Rambaut, A., Lam, T. T., Carvalho, L. M. & Pybus, O. G. Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Biol. G066215N, G0D5117N and G0B9317N)) and by the European Unions Horizon 2020 project MOOD (no. It is available as a command line tool and a web application. Even before the COVID-19 pandemic, pangolins have been making headlines. All authors contributed to analyses and interpretations. Biol. Using the most conservative approach (NRR1), the divergence time estimate for SARS-CoV-2 and RaTG13 is 1969 (95% HPD: 19302000), while that between SARS-CoV and its most closely related bat sequence is 1962 (95% HPD: 19321988); see Fig. Lin, X. et al. . N. China corresponds to Jilin, Shanxi, Hebei and Henan provinces, and the N. China clade also includes one sequence sampled in Hubei Province in 2004. Thank you for visiting nature.com. Time-measured phylogenetic reconstruction was performed using a Bayesian approach implemented in BEAST42 v.1.10.4. Boxes show 95% HPD credible intervals. Bioinformatics 28, 32483256 (2012). Here, we analyse the evolutionary history of SARS-CoV-2 using available genomic data on sarbecoviruses. Wu, F. et al. J. Virol. 2, bottom) show that SARS-CoV-2 is unlikely to have acquired the variable loop from an ancestor of Pangolin-2019 because these two sequences are approximately 1015% divergent throughout the entire Sprotein (excluding the N-terminal domain). A., Filip, I., AlQuraishi, M. & Rabadan, R. Recombination and lineage-specific mutations led to the emergence of SARS-CoV-2. Coronavirus: Pangolins found to carry related strains. Get the most important science stories of the day, free in your inbox. Google Scholar. 82, 48074811 (2008). Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Duchene, S. et al. Concatenated region ABC is NRR1. As a proxy, it would be possible to model the long-term purifying selection dynamics as a major source of time-dependent rates43,44,52, but this is beyond the scope of the current study. To estimate non-synonymous over synonymous rate ratios for the concatenated coding genes, we used the empirical Bayes Renaissance countingprocedure67. 36, 7597 (2002). We aimed to analyze 3 naso-oropharyngeal swab samples collected between August and December 2021 to describe the amino acid changes present in the sequence reads that may have a role in the emergence of new . RegionC showed no PI signals within it. 5). MERS-CoV data were subsampled to match sample sizes with SARS-CoV and HCoV-OC43. Biol. Virus Evol. A novel bat coronavirus closely related to SARS-CoV-2 contains natural insertions at the S1/S2 cleavage site of the Spike protein. These rate priors are subsequently used in the Bayesian inference of posterior rates for NRR1, NRR2, and NRA3 as indicated by the solid arrows. Bryant, D. & Moulton, V. Neighbor-Net: an agglomerative method for the construction of phylogenetic networks. In the variable-loop region, RaTG13 diverges considerably with the TMRCA, now outside that of SARS-CoV-2 and the Pangolin Guangdong 2019 ancestor, suggesting that RaTG13 has acquired this region from a more divergent and undetected bat lineage. 2, vew007 (2016). & Andersen, K. G. The evolution of Ebola virus: insights from the 20132016 epidemic. DRAGEN COVID Lineage App This app aligns reads to a SARS-CoV-2 reference genome and reports coverage of targeted regions. Sorting these breakpoint-free regions (BFRs) by length results in two segments >5kb: an ORF1a subregion spanning nucleotides (nt) 3,6259,150 and the first half of ORF1b spanning nt13,29119,628 (sequence numbering given in Source Data, https://github.com/plemey/SARSCoV2origins). Two exceptions can be seen in the relatively close relationship of Hong Kong viruses to those from Zhejiang Province (with two of the latter, CoVZC45 and CoVZXC21, identified as recombinants) and a recombinant virus from Sichuan for which part of the genome (regionB of SC2018 in Fig. The fact that they are geographically relatively distant is in agreement with their somewhat distant TMRCA, because the spatial structure suggests that migration between their locations may be uncommon. We find that the sarbecovirusesthe viral subgenus containing SARS-CoV and SARS-CoV-2undergo frequent recombination and exhibit spatially structured genetic diversity on a regional scale in China. Means and 95% HPD intervals are 0.080 [0.0580.101] and 0.530 [0.3040.780] for the patristic distances between SARS-CoV-2 and RaTG13 (green) and 0.143 [0.1090.180] and 0.154 [0.0930.231] for the patristic distances between SARS-CoV-2 and Pangolin 2019 (orange). 4), that region and shorter BFRs were not included in combined putative non-recombinant regions. Microbiol. Extended Data Fig. 190, 20882095 (2004). Patino-Galindo, J. Download a free copy. At present, we analyzed the diversity of SARS-CoV-2 viral genomes in India to know the evolutionary patterns of viruses in the country through their pangolin lineage and GISAID-Clade. Coronavirus Disease 2019 (COVID-19) Situation Report 51 (World Health Organization, 2020). If the latter still identified non-negligible recombination signal, we removed additional genomes that were identified as major contributors to the remaining signal. The estimated divergence times for the pangolin virus most closely related to the SARS-CoV-2/RaTG13 lineage range from 1851 (1730-1958) to 1877 (1746-1986), indicating that these pangolin . As illustrated by the dashed arrows, these two posteriors motivate our specification of prior distributions with standard deviations inflated 10-fold (light color). Visual exploration using TempEst39 indicates that there is no evidence for temporal signal in these datasets (Extended Data Fig. We thank T. Bedford for providing M.F.B. We showed that severe acute respiratory syndrome coronavirus 2 is probably a novel recombinant virus. 36) (RDP, GENECONV, MaxChi, Bootscan, SisScan and 3SEQ) and considered recombination signals detected by more than two methods for breakpoint identification. BEAST inferences made use of the BEAGLE v.3 library68 for efficient likelihood computations. An initial genomic sequence analysis found that the reemergence of COVID-19 in New Zealand was caused by a SARS-CoV-2 from the (now ancestral) lineage B.1.1.1 of the pangolin nomenclature ( 17 ). Given what was known about the origins of SARS, as well as identification of SARS-like viruses circulating in bats that had binding sites adapted to human receptors29,30,31, appropriate measures should have been in place for immediate control of outbreaks of novel coronaviruses. Chernomor, O. et al. Maclean, O. Mol. Adv. This produced non-recombining alignment NRA3, which included 63 of the 68genomes. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Lie, P., Chen, W. & Chen, J.-P. Virological.org http://virological.org/t/ncovs-relationship-to-bat-coronaviruses-recombination-signals-no-snakes-no-evidence-the-2019-ncov-lineage-is-recombinant/331 (2020). c, Maximum likelihood phylogenetic trees rooted on a 2007 virus sampled in Kenya (BtKy72; root truncated from images), shown for five BFRs of the sarbecovirus alignment. 1 Phylogenetic relationships in the C-terminal domain (CTD). 5. Based on the identified breakpoints in each genome, only the major non-recombinant region is kept in each genome while other regions are masked. from the European Research Council under the European Unions Horizon 2020 research and innovation programme (grant agreement no. Alexandre Hassanin, Vuong Tan Tu, Gabor Csorba, Nicola F. Mller, Kathryn E. Kistler & Trevor Bedford, Jack M. Crook, Ivana Murphy, Diana Bell, Simon Pollett, Matthew A. Conte, Irina Maljkovic Berry, Yatish Turakhia, Bryan Thornlow, Russell Corbett-Detig, Nature Microbiology Holmes, E. C. The Evolution and Emergence of RNA Viruses (Oxford Univ. and X.J. According to GISAID . Nat. Because the SARS-CoV-2 S protein has been implicated in past recombination events or possibly convergent evolution12, we specifically investigated several subregions of the Sproteinthe N-terminal domain of S1, the C-terminal domain of S1, the variable-loop region of the C-terminal domain, and S2. Bioinformatics 22, 26882690 (2006). Using these breakpoints, the longest putative non-recombining segment (nt1,88521,753) is 9.9kb long, and we call this region NRR2. The unsampled diversity descended from the SARS-CoV-2/RaTG13 common ancestor forms a clade of bat sarbecoviruses with generalist propertieswith respect to their ability to infect a range of mammalian cellsthat facilitated its jump to humans and may do so again. Instead, similarity in codon usage metrics between the SARS-CoV-2 and eukaryotes analyzed was correlated with coding sequence GC content of the eukaryote, with more similar codon usage being identified in eukaryotes with low GC content similar to that of the coronavirus (b). Posterior means (horizontal bars) of patristic distances between SARS-CoV-2 and its closest bat and pangolin sequences, for the spike proteins variable loop region and CTD region excluding the variable loop. Google Scholar. Posterior rate distributions for MERS-CoV (far left) and HCoV-OC43 (far right) using BEAST on n=27 sequences spread over 4 years (MERS-CoV) and n=27 sequences spread over 49 years (HCoV-OC43). Lancet 395, 949950 (2020). Methods Ecol. Viruses 11, 174 (2019). Nature 558, 180182 (2018). Across a large region of the virus genome, corresponding approximately to ORF1b, it did not cluster with any of the known bat coronaviruses indicating that recombination probably played a role in the evolutionary history of these viruses5,7. Phylogenies of subregions of NRR1 depict an appreciable degree of spatial structuring of the bat sarbecovirus population across different regions (Fig. Mol. Genetics 176, 10351047 (2007). 4 we compare these divergence time estimates to those obtained using the MERS-CoV-centred rate priors for NRR1, NRR2 and NRA3. For the HCoV-OC43, MERS-CoV and SARS datasets we specified flexible skygrid coalescent tree priors. Emergence of SARS-CoV-2 through recombination and strong purifying selection.