Kew Tree of Life Explorer - Release notes
=========================================

The Kew Tree of Life Explorer allows users to explore evolutionary trees of
life and to access the genomic data that underpin them. It is an output of the
Plant and Fungal Trees of Life Project (PAFTOL) at the Royal Botanic Gardens, Kew
(https://www.kew.org/science/our-science/projects/plant-and-fungal-trees-of-life),
which aims to discover and disseminate the evolutionary history of all plant
and fungal genera through phylogenetic approaches. Tree of Life data are
periodically released via the Kew secure file transfer protocol (SFTP) site
(sftp.kew.org/pub/treeoflife) and are additionally made available for
interactive web-based exploration at http://treeoflife.kew.org.

Data releases are identified with a major and a minor version number, separated
by a ‘.’.  This document describes the contents of Kew Tree of Life Explorer
Release 4.0.  A description of the files provided in this release is available
at sftp.kew.org/pub/paftfol/README.txt.


Scope and methodology
=====================

This version of the Kew Tree of Life data set comprises data from 64 orders,
417 families, 12,239 genera and 20,375 species, across 20,475 sequenced
angiosperm samples, based on the Angiosperms353 gene set.

The methodology used to obtain gene sequence data and to infer phylogenetic
trees presented in the Tree of Life Explorer has been described Baker et al
(2022; https://doi.org/10.1093/sysbio/syab035). In brief, various public tools
are used to recover gene sequence data from a range of data types. The species
tree displayed in the Kew Tree of Life Explorer is built in a two-step process.
In the first step, gene trees are estimated and used to build a preliminary
species tree, which is used for phylogenetic validation of specimen family
identity. This validation (along with DNA barcode validation) informs the final
selection of samples for inclusion in the second, final step in which gene
trees and the species tree are rebuilt.

The methodology of Baker et al. (2022) has been amended and updated for this
data release (4.0) as described below.

Source nucleotide (DNA or RNA) data, either in the form of raw sequence reads,
or assembled transcripts or genomes, have either been downloaded from the
European Nucleotide Archive (ENA), other public repositories (e.g. CyVerse and
FigShare) or have been generated de novo by PAFTOL or its collaborators and
submitted to the INSDC. A list of accession numbers used is included with every
release.

In this data release, we introduce for the first time samples with multiple
sequencing runs. For these samples, the input data is the concatenation of
several FASTQ files from independent sequencing runs, always from the same
specimen. These are combined prior to downstream processing in order to improve
gene recovery. Raw data from each sequencing run is submitted separately to
ENA, with the concatenation of the respective accessions used as sequence
identifier (e.g. ERR16917359_ERR7621247 for Asteriscus intermedius).
Information on the run accessions used for each sample is provided in the
sequence manifest.

Gene recovery from raw reads (e.g. target sequence capture data, other raw read
data) is performed using HybPiper2 (Johnson et al., 2016) and gene recovery
from assembled sequences (e.g. annotated and unannotated genomes, assembled
transcriptomes) is performed by Captus (Ortiz et al., 2023).

The software for estimating the gene alignments and trees has changed from that
specified by Baker et al. (2022). Following the approach of Zuntini, Carruthers
et al. (2024; https://doi.org/10.1038/s41586-024-07324-0), we first produce
'backbone' gene alignments and trees from a subset of the best quality samples
available distributed across all angiosperm families, including two samples per
family or proportionally more for larger families. This process mirrors the
structure of the main workflow with two steps of gene alignment and tree
estimation, with an intermediate validation step, but the software used differs
as follows. To produce the backbone gene alignments, amino acid sequences are
aligned with MAFFT (Katoh & Standley, 2013) and the corresponding DNA sequences
are converted into a codon-based DNA alignment with PAL2NAL (Suyama et al.,
2006). IQ-TREE 2 (Minh et al., 2020) is used to estimate each gene tree.

To build global alignments, sequences from each of the remaining samples in the
dataset are aligned with EMMA (Shen et al., 2023) using the ‘backbone’ gene
alignments and trees as guides. An extra filtering step on the gene alignments
has been added, performed by TAPER (Zhang et al., 2021), to remove small
erroneous stretches of sequence in the alignments.

From each global gene alignment, a gene tree is estimated using a
divide-and-conquer approach. Initially, FastTree (Price et al., 2010) is used
to create a starting guide tree, with the respective gene tree from the
backbone as topological constraint. Then, the gene tree is divided into evenly
sized subsets with NJMerge (Molly & Warnow, 2018), and, using the corresponding
alignment subsets, a more thorough tree method (IQ-TREE 2, Minh et al., 2020)
is used to build subset trees with the aim of improving the tree topology
across each subset. Finally, Guide Tree Merger (Smirnov and Warnow, 2020) is
used to combine the IQTREE 2 subset trees together into a single global gene
tree. Before the TreeShrink step, it is necessary to recompute the branch
lengths on the gene trees using RAxML-NG (Kozlov et al., 2019).

The species tree is estimated with weighted ASTRAL (Zhang & Mirarab, 2022) from
all the gene trees and rooted with a set of gymnosperm species. The tree is
annotated with local posterior probabilities as indicators of branch support.
Due to a technical issue with some support values being unavailable (appearing
as '-nan' values) from weighted ASTRAL, the support values have also been
calculated with ASTRAL-IV (Zhang et al., 2025) using the same weighted ASTRAL
tree topology. Support values visualised in the Kew Tree of Life Explorer tree
viewer are derived from ASTRAL-IV. Both sets of support values are available
from the sftp links (see Data Access).

This release was produced with the software listed below:

AMAS - Borowiec, M.L. (2016) https://doi.org/10.7717/peerj.1660
https://github.com/marekborowiec/AMAS

ASTRAL4  v. 1.23.4.6 - Zhang, C., Nielsen, R. & Mirarab, S. (2025)
https://doi.org/10.1093/molbev/msaf172
https://github.com/chaoszhang/ASTER/blob/master/tutorial/astral4.md

Captus v.1.4.8 & 1.5.1 - Ortiz, E.M., Höwener, A., Shigita, G., Raza, M.,
Maurin, O., Zuntini, A., Forest, F., Baker, W.J. & Schaefer, H. (2023)
https://doi.org/10.1101/2023.10.27.564367
https://github.com/edgardomortiz/captus EMMA v. 0.1.0 - Shen, C., Liu, B.,
Williams, K.P. & Warnow T. (2023) https://doi.org/10.1186/s13015-023-00247-x
https://github.com/c5shen/EMMA

FastTree v. 2.1.11 - Price, M.N., Dehal, P.S. & Arkin, A.P. (2010)
https://doi.org/10.1371/journal.pone.0009490
http://www.microbesonline.org/fasttree/

Guide Tree Merger - Smirnov, V. & Warnow, T. (2020)
https://doi.org/10.1186/s12864-020-6605-1.  https://github.com/vlasmirnov/GTM

HybPiper v. 2.2.0 - Johnson MG, Gardner EM, Liu Y., Medina, R., Goffinet, B.,
Shaw, A.J., Zerega, N.J.C. & Wickett, N.J. (2016)
https://doi.org/10.3732/apps.1600016 https://github.com/mossmatters/HybPiper

IQ-TREE 2 v. 2.4.0 - Minh B.Q., Schmidt H.A., Chernomor O., Schrempf D.,
Woodhams M.D., Von Haeseler A. & Lanfear R. (2020)
https://doi.org/10.1093/molbev/msaa015 https://github.com/iqtree/iqtree2

MAFFT v. 7.526 - Katoh, K. & Standley, D.M. (2013)
https://doi.org/10.1093/molbev/mst010 https://mafft.cbrc.jp/alignment/software/

Newick Utilities v. 1.6.0 - Junier, T. & Zdobnov, E.M. (2010)
https://doi.org/10.1093/bioinformatics/btq243
https://github.com/tjunier/newick_utils

NJMERGE - Molly, E.K. & Warnow, T. (2018)
https://doi.org/10.13012/B2IDB-1424746_V1

PAL2NAL v.14 - Suyama, M., Torrents, D. & Bork, P. (2006)
https://doi.org/10.1093/nar/gkl315 https://github.com/liaochenlanruo/PAL2NAL

RAxML-NG v. 1.2.2 - Kozlov, A.M., Darriba, D., Flouri T., Morel, B. &
Stamatakis, A. (2019) https://github.com/amkozlov/raxml-ng

TAPER v. 1.0.2 - Zhang, C., Zhao, Y., Braun, E.L. & Mirarab, S. (2021)
https://doi.org/10.1111/2041-210X.13696 https://github.com/chaoszhang/TAPER

TreeShrink v. 1.3.9 - Mai, U. & Mirarab, S. (2018)
https://doi.org/10.1186/s12864-018-4620-2 https ://github.com/uym2/TreeShrink

Weighted ASTRAL v. 1.23.3.7 III - Zhang,C. & Mirarab S. (2022)
https://doi.org/10.1093/molbev/msac215
https://github.com/chaoszhang/ASTER/blob/master/tutorial/wastral.md


An internal identifier was used for one sample for which a public accession was
not available at the time that Release 4.0 was made public:

Data repository	Sequence ID	Sequence type		Species name	Project
Kew_internal	SFG00267	Unannotated genome	Bhesa robusta	Dataset: Singapore BTNR


Licensing
=========

Kew Tree of Life data (hereafter “the data”) are released under the Creative
Commons Attribution 4.0 International (CC BY 4.0) license
(https://creativecommons.org/licenses/by/4.0). To attribute the data, please
follow our citation guidelines (below) and reference the appropriate data
release number.

In many cases, the data have been released prior to publication in the academic
literature, in accordance with the Toronto guidelines on pre-publication data
sharing (https://www.nature.com/articles/461168a). Users may freely analyse
released prepublication data, but should act responsibly by 1) respecting the
scientific etiquette that allows data producers to publish the first global
analyses of their data set, 2) accurately and completely citing the source of
prepublication data, and 3) contacting the data producers to discuss
publication plans in the case of overlap between planned analyses. Please
contact us (at the email address below) if you have any questions about what
you may do with the data.
 
Citing us
=========

When using the Kew Tree of Life Explorer, please cite the following
publication:

Baker et al. 2022. A Comprehensive Phylogenomic Platform for Exploring the
Angiosperm Tree of Life. Systematic Biology 71: 301–319.
https://doi.org/10.1093/sysbio/syab035.

Contact us
==========

Please contact treeoflife AT kew DOT org for support or advice.