Kew Tree of Life Explorer - Release notes ========================================= The Kew Tree of Life Explorer allows users to explore evolutionary trees of life and to access the genomic data that underpin them. It is an output of the Plant and Fungal Trees of Life Project (PAFTOL) at the Royal Botanic Gardens, Kew (https://www.kew.org/science/our-science/projects/plant-and-fungal-trees-of-life), which aims to discover and disseminate the evolutionary history of all plant and fungal genera through phylogenetic approaches. Tree of Life data are periodically released via the Kew secure file transfer protocol (SFTP) site (sftp.kew.org/pub/treeoflife) and are additionally made available for interactive web-based exploration at http://treeoflife.kew.org. Data releases are identified with a major and a minor version number, separated by a ‘.’. This document describes the contents of Kew Tree of Life Explorer Release 4.0. A description of the files provided in this release is available at sftp.kew.org/pub/paftfol/README.txt. Scope and methodology ===================== This version of the Kew Tree of Life data set comprises data from 64 orders, 417 families, 12,239 genera and 20,375 species, across 20,475 sequenced angiosperm samples, based on the Angiosperms353 gene set. The methodology used to obtain gene sequence data and to infer phylogenetic trees presented in the Tree of Life Explorer has been described Baker et al (2022; https://doi.org/10.1093/sysbio/syab035). In brief, various public tools are used to recover gene sequence data from a range of data types. The species tree displayed in the Kew Tree of Life Explorer is built in a two-step process. In the first step, gene trees are estimated and used to build a preliminary species tree, which is used for phylogenetic validation of specimen family identity. This validation (along with DNA barcode validation) informs the final selection of samples for inclusion in the second, final step in which gene trees and the species tree are rebuilt. The methodology of Baker et al. (2022) has been amended and updated for this data release (4.0) as described below. Source nucleotide (DNA or RNA) data, either in the form of raw sequence reads, or assembled transcripts or genomes, have either been downloaded from the European Nucleotide Archive (ENA), other public repositories (e.g. CyVerse and FigShare) or have been generated de novo by PAFTOL or its collaborators and submitted to the INSDC. A list of accession numbers used is included with every release. In this data release, we introduce for the first time samples with multiple sequencing runs. For these samples, the input data is the concatenation of several FASTQ files from independent sequencing runs, always from the same specimen. These are combined prior to downstream processing in order to improve gene recovery. Raw data from each sequencing run is submitted separately to ENA, with the concatenation of the respective accessions used as sequence identifier (e.g. ERR16917359_ERR7621247 for Asteriscus intermedius). Information on the run accessions used for each sample is provided in the sequence manifest. Gene recovery from raw reads (e.g. target sequence capture data, other raw read data) is performed using HybPiper2 (Johnson et al., 2016) and gene recovery from assembled sequences (e.g. annotated and unannotated genomes, assembled transcriptomes) is performed by Captus (Ortiz et al., 2023). The software for estimating the gene alignments and trees has changed from that specified by Baker et al. (2022). Following the approach of Zuntini, Carruthers et al. (2024; https://doi.org/10.1038/s41586-024-07324-0), we first produce 'backbone' gene alignments and trees from a subset of the best quality samples available distributed across all angiosperm families, including two samples per family or proportionally more for larger families. This process mirrors the structure of the main workflow with two steps of gene alignment and tree estimation, with an intermediate validation step, but the software used differs as follows. To produce the backbone gene alignments, amino acid sequences are aligned with MAFFT (Katoh & Standley, 2013) and the corresponding DNA sequences are converted into a codon-based DNA alignment with PAL2NAL (Suyama et al., 2006). IQ-TREE 2 (Minh et al., 2020) is used to estimate each gene tree. To build global alignments, sequences from each of the remaining samples in the dataset are aligned with EMMA (Shen et al., 2023) using the ‘backbone’ gene alignments and trees as guides. An extra filtering step on the gene alignments has been added, performed by TAPER (Zhang et al., 2021), to remove small erroneous stretches of sequence in the alignments. From each global gene alignment, a gene tree is estimated using a divide-and-conquer approach. Initially, FastTree (Price et al., 2010) is used to create a starting guide tree, with the respective gene tree from the backbone as topological constraint. Then, the gene tree is divided into evenly sized subsets with NJMerge (Molly & Warnow, 2018), and, using the corresponding alignment subsets, a more thorough tree method (IQ-TREE 2, Minh et al., 2020) is used to build subset trees with the aim of improving the tree topology across each subset. Finally, Guide Tree Merger (Smirnov and Warnow, 2020) is used to combine the IQTREE 2 subset trees together into a single global gene tree. Before the TreeShrink step, it is necessary to recompute the branch lengths on the gene trees using RAxML-NG (Kozlov et al., 2019). The species tree is estimated with weighted ASTRAL (Zhang & Mirarab, 2022) from all the gene trees and rooted with a set of gymnosperm species. The tree is annotated with local posterior probabilities as indicators of branch support. Due to a technical issue with some support values being unavailable (appearing as '-nan' values) from weighted ASTRAL, the support values have also been calculated with ASTRAL-IV (Zhang et al., 2025) using the same weighted ASTRAL tree topology. Support values visualised in the Kew Tree of Life Explorer tree viewer are derived from ASTRAL-IV. Both sets of support values are available from the sftp links (see Data Access). This release was produced with the software listed below: AMAS - Borowiec, M.L. (2016) https://doi.org/10.7717/peerj.1660 https://github.com/marekborowiec/AMAS ASTRAL4 v. 1.23.4.6 - Zhang, C., Nielsen, R. & Mirarab, S. (2025) https://doi.org/10.1093/molbev/msaf172 https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral4.md Captus v.1.4.8 & 1.5.1 - Ortiz, E.M., Höwener, A., Shigita, G., Raza, M., Maurin, O., Zuntini, A., Forest, F., Baker, W.J. & Schaefer, H. (2023) https://doi.org/10.1101/2023.10.27.564367 https://github.com/edgardomortiz/captus EMMA v. 0.1.0 - Shen, C., Liu, B., Williams, K.P. & Warnow T. (2023) https://doi.org/10.1186/s13015-023-00247-x https://github.com/c5shen/EMMA FastTree v. 2.1.11 - Price, M.N., Dehal, P.S. & Arkin, A.P. (2010) https://doi.org/10.1371/journal.pone.0009490 http://www.microbesonline.org/fasttree/ Guide Tree Merger - Smirnov, V. & Warnow, T. (2020) https://doi.org/10.1186/s12864-020-6605-1. https://github.com/vlasmirnov/GTM HybPiper v. 2.2.0 - Johnson MG, Gardner EM, Liu Y., Medina, R., Goffinet, B., Shaw, A.J., Zerega, N.J.C. & Wickett, N.J. (2016) https://doi.org/10.3732/apps.1600016 https://github.com/mossmatters/HybPiper IQ-TREE 2 v. 2.4.0 - Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D., Woodhams M.D., Von Haeseler A. & Lanfear R. (2020) https://doi.org/10.1093/molbev/msaa015 https://github.com/iqtree/iqtree2 MAFFT v. 7.526 - Katoh, K. & Standley, D.M. (2013) https://doi.org/10.1093/molbev/mst010 https://mafft.cbrc.jp/alignment/software/ Newick Utilities v. 1.6.0 - Junier, T. & Zdobnov, E.M. (2010) https://doi.org/10.1093/bioinformatics/btq243 https://github.com/tjunier/newick_utils NJMERGE - Molly, E.K. & Warnow, T. (2018) https://doi.org/10.13012/B2IDB-1424746_V1 PAL2NAL v.14 - Suyama, M., Torrents, D. & Bork, P. (2006) https://doi.org/10.1093/nar/gkl315 https://github.com/liaochenlanruo/PAL2NAL RAxML-NG v. 1.2.2 - Kozlov, A.M., Darriba, D., Flouri T., Morel, B. & Stamatakis, A. (2019) https://github.com/amkozlov/raxml-ng TAPER v. 1.0.2 - Zhang, C., Zhao, Y., Braun, E.L. & Mirarab, S. (2021) https://doi.org/10.1111/2041-210X.13696 https://github.com/chaoszhang/TAPER TreeShrink v. 1.3.9 - Mai, U. & Mirarab, S. (2018) https://doi.org/10.1186/s12864-018-4620-2 https ://github.com/uym2/TreeShrink Weighted ASTRAL v. 1.23.3.7 III - Zhang,C. & Mirarab S. (2022) https://doi.org/10.1093/molbev/msac215 https://github.com/chaoszhang/ASTER/blob/master/tutorial/wastral.md An internal identifier was used for one sample for which a public accession was not available at the time that Release 4.0 was made public: Data repository Sequence ID Sequence type Species name Project Kew_internal SFG00267 Unannotated genome Bhesa robusta Dataset: Singapore BTNR Licensing ========= Kew Tree of Life data (hereafter “the data”) are released under the Creative Commons Attribution 4.0 International (CC BY 4.0) license (https://creativecommons.org/licenses/by/4.0). To attribute the data, please follow our citation guidelines (below) and reference the appropriate data release number. In many cases, the data have been released prior to publication in the academic literature, in accordance with the Toronto guidelines on pre-publication data sharing (https://www.nature.com/articles/461168a). Users may freely analyse released prepublication data, but should act responsibly by 1) respecting the scientific etiquette that allows data producers to publish the first global analyses of their data set, 2) accurately and completely citing the source of prepublication data, and 3) contacting the data producers to discuss publication plans in the case of overlap between planned analyses. Please contact us (at the email address below) if you have any questions about what you may do with the data. Citing us ========= When using the Kew Tree of Life Explorer, please cite the following publication: Baker et al. 2022. A Comprehensive Phylogenomic Platform for Exploring the Angiosperm Tree of Life. Systematic Biology 71: 301–319. https://doi.org/10.1093/sysbio/syab035. Contact us ========== Please contact treeoflife AT kew DOT org for support or advice.