Index of /pub/treeoflife

      Name                    Last modified      Size  Description
Parent Directory - README.txt 2021-02-24 07:43 19K current_release/ 2021-02-23 12:57 - releases/ 2021-02-02 12:24 -
README
======

The Kew Tree of Life Explorer allows users to explore evolutionary trees of life and to 
access the genomic data that underpin them. It is an output of the Plant and Fungal Trees 
of Life Project (PAFTOL) at the Royal Botanic Gardens, Kew 
(https://www.kew.org/science/our-science/projects/plant-and-fungal-trees-of-life), which 
aims to discover and disseminate the evolutionary history of all plant and fungal genera 
through phylogenetic approaches.  Tree of Life data are periodically released via the Kew 
secure file transfer protocol (SFTP) site (sftp.kew.org/pub/treeoflife) and is 
additionally made available for interactive web-based exploration at 
http://treeoflife.kew.org.

The Kew Tree of Life SFTP site
==============================

|-- README.txt  This document
|
|-- current_release  A link to the current release of Kew Tree of Life Explorer
|
|-- releases  A directory containing all previous releases of Kew Tree of Life data
     |
	 |-- <release number>  One directory for each Kew Tree of Life data release
           |
           |-- kew_tree_of_life_release_notes_<release_number>.txt  A document describing 
           |                                                        the contents of the 
           |                                                        release
           |
           |-- kew_tree_of_life_release_notes.txt  A symlink to the above file
           | 
           |-- sequence_manifest.txt  A document listing the accession numbers (in public
           |                          repositories) of all nucleotide sequence data used 
           |                          in the release
           |
           |-- deleted_sequences.txt  A document listing the accession numbers (in public
           |                          repositories) of all nucleotide sequence data used
           |                          in previous releases of the Kew Tree of Life that 
           |                          have not been used in this one
           |
           |-- specimen_manifest.txt  A document listing the scientific name of all
           |                          species included in this release, with additional
           |                          information about the specimens which have been
           |                          sampled
           |
           |
           |-- gene_manifest.txt  A document listing the genes included in this release
           |
           |
           |-- fasta  A directory containing gene sequence in FASTA format. Sequences are 
           |    |     generated from recovery processes, for a number of specified genes, 
           |    |     and according to a specified method
           |    |
           |    |-- alignments A directory containing alignment data for each gene, in 
           |    |              aligned FASTA format.
           |    |
           |    |-- by_gene  A directory containing files containing all assembled 
           |    |            sequencesfor a given gene
           |    |
           |    |-- by_recovery  A directory containing files containing all assembled 
           |                      sequences for a given recovery
           |
           |-- tree  A directory containing tree files in Newick format for genes and
           |    |    species
           |    |
           |    |-- gene  A directory containing tree files in Newick format for each
           |    |         gene used to build the species tree
           |    |
           |    |-- species  A directory containing the species tree file for this 
           |                 release in Newick format
           |                  
           |-- nex
           |    | 
           |    |-- species  A directory containing the species tree file for this 
           |                 release in NEXUS format.
           |               
           |-- svg
                |
                |-- species  A directory containing the species tree file for this 
                             release in SVG format.
           
File naming conventions
=======================

Files in fasta/alignments:
--------------------------

gene_id.aln.fasta

Files in fasta/by_gene:
-----------------------

<gene_id>.<molecule_type>.fasta

The Gene ID identfies the pan-species gene concept, and is taken from the Angiosperm 353 
data set (Johnson et. al, https://doi.org/10.1093/sysbio/syy086).

Molecule types used in this release:

DNA

Protein files will be provided in future releases.

Files in fasta/by_recovery:
---------------------------

<repository_name>.<sequence_id>.<species_name>.<sequence set>.fasta

A "recovery" is a bioinformatic analysis of a set of sequence data from a single species, 
yielding a set of gene sequences. All sequence sets used for recoveries are accessioned in
a public repository.

Repositories in use in this release:

INSDC: The ENA/GenBank/DDBJ International Nucleotide Sequence Database Collaboration 
(INSDC). 
oneKP: The data repository of the One Thousand Plant Transcriptomes Initiative, available
here: https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/
oneKP_capstone_2019

The sequence_id is the identifier used for those sequences within the named repository.

Sequence sets in use in this release:

a353: the Angiosperms353 gene set 
(see Johnson et al., https://doi.org/10.1093/sysbio/syy086)

The files in this directory always contain DNA sequence.  It is not anticipated that 
protein sequence files will be made available on a per recovery basis.

Files in tree/gene
-------------------

gene_id.tree

An gene tree  for each gene from the corresponding alignments in fasta/alignments, in 
Newick format.

Nodes are labelled as follows:

<Order>_<Family>_<Genus>_<Species>_><Sequence_ID>

Where the sequence ID is the identifier of the sequence derived from the sample as stored
in a sequence repository.  Further details are provided in the sequence manifest.

Files in tree/species
---------------------

treeoflife.<release_id>.tree
treeoflife.current.tree

A file containing the Kew "tree of life" for all species included in this release in 
Newick format.  The file name contains the release ID; a symlink to the current tree is 
provided with every release for convenient download.

treeoflife.all_support_values.<release_id>.tree
treeoflife.all_support_values.current.tree

A file containing the Kew "tree of life" for all species included in this release in 
Newick format, with the inclusion of all support value data for each node defined as 
follows:

q1:  quartet support for the main topology
q2:  quartet support for the first alternative topology
q3:  quartet support for the second alternative topology
f1:  number of quartet trees in all the gene trees that support the main topology
f2:  number of quartet trees in all the gene trees that support the first alternative 
     topology
f3:  number of quartet trees in all the gene trees that support the second alternative 
     topology
pp1: local posterior probability for the main topology
pp2: local posterior probability for the first alternative topology,
pp3: local posterior probability for the second alternative topology
QC:  number of quartets defined around each branch
EN:  effective number of genes for the branch.

Nodes are labelled (in both trees) as follows:

<Order>_<Family>_<Genus>_<Species>_><Sequence_ID>

Where the sequence ID is the identifier of the sequence derived from the sample as stored
in a sequence repository.  Further details are provided in the sequence manifest.

Files in nex
------------

treeoflife.<release_id>.nex
treeoflife.current.nex

A file containing the Kew "tree of life" for all species included in this release in 
NEXUS format.  The file name contains the release ID; a symlinked file to the current 
tree is provided with every release for convenient download.

Files in SVG
------------

treeoflife.<release_id>.svg
treeoflife.current.svg

A file containing the Kew "tree of life" for all species included in this release in 
Scalable Vector Graphics (SVG) format.  The file name contains the release ID; a 
symlinked file to the current tree is provided with every release for convenient download.

FASTA headers
=============

Sequences in FASTA files have headers as follows:

\><gene_id> Gene_Name:<gene_name> Species:<species_name> Repository:<repository_name> 
Sequence_ID:<sequence_id>

The Gene ID identifies the pan-species gene concept, and is taken from the Angiosperm 353 
data set (Johnson et. al, https://doi.org/10.1093/sysbio/syy086).

The gene name is an exemplar gene name for the gene that has been recovered (i.e. in use 
for this gene in at least one of the species from which the gene has been recovered).  It 
is not necessarily the name by which the gene is known in the recovered species.  All 
instances of this gene are assigned the same name in a single release.  The gene name is 
not guaranteed to be stable between releases. To identify the same gene in successive 
releases, use the Gene ID. If no suitably named exemplar gene has been found the gene 
name is given as "NA".

The species name comprises genus and species names in accordance with scientific 
convention and uses underscores in place of spaces.

The sequence repository and sequence identifier as defined as in the names of the files
in the fasta/by_recovery directory (see above).

Deleted sequences
=================

This is a tab-delineated file, with columns as follows:

1. Repository name
2. Sequence identifier
3. Sequence type. One of genome, transcript, read.
4. Scientific species name
5. Release ID where first included
6. Release ID from which sequence was deleted
7. Reason for deletion

The values of "Reason for deletion" currently are: 

Failed_barcoding: the taxonomic identity of the sequence generated was inconsistent with 
                  the sequence obtained at known barcoding loci

Alternative_specimen: sequence from an alternative specimen has been chosen to represent 
                    this species

Sequence manifest
=================

This is a tab-delineated file, with columns as follows:

1. Repository name
2. Sequence identifier
3. Sequence type. One of genome, transcript, read
4. Scientific species name
5. Project name.  One of PAFTOL, oneKP. A '-' is used when the sequence has not been 
   generated by a known phyologenetic project

The values of 'Collection ID' currently in use are:

INSDC: The ENA/GenBank/DDBJ International Nucleotide Sequence Database Collaboration 
(INSDC) 

Deleted sequences
=================

This is a tab delineated file, with columns as follows:

1. Repository name	
2. Sequence identifier
3. Sequence_type. One of genome, transcript, read	
4. Scientific species name
5. Project name	
6. Release first included
7. Release deleted	
8. Reason for deletion.  One of 'Duplicated_sample', 'Failed_family_identification'

Specimen manifest
=================

This is a tab-delineated file, with columns as follows:

1. Scientific species name
2. Collection ID (of the specimen used); from Index Herbarium
3. Specimen ID or barcode
4. Voucher information
5. Specimen URL (to an online catalogue entry for that specimen, where available)

The values of 'Collection ID' currently in use are:

ADU:   University of Adelaide (Australia, South Australia, Adelaide)
AK:    Auckland War Memorial Museum (New Zealand, Auckland)
ATH:   Goulandris Natural History Museum (Greece, Athens)
B:     Botanischer Garten und Botanisches Museum Berlin, Zentraleinrichtung der Freien 
       Universitaet Berlin (Germany, Berlin)
BA:    Museo Argentino de Ciencias Naturales "Bernardino Rivadavia" (Argentina, Buenos 
       Aires)
BC:    Institut Botanic de Barcelona (Spain, Barcelona)
BCN:   University of Barcelona (Spain, Barcelona)
BG:    University of Bergen (Norway, Bergen)
BH:    Cornell University (U.S.A., New York, Ithaca)
BHCB:  Universidade Federal de Minas Gerais (Brazil, Minas Gerais, Belo Horizonte)
BHO:   Ohio University (U.S.A. Ohio. Athens)
BNRH:  Buffelskloof Nature Reserve (South Africa, Mpumalanga Province, Lydenburg)
BISH:  Bishop Museum (U.S.A, Hawaii, Honolulu)
BJFC:  Beijing Forestry University (People's Republic of China, Beijing)
BM:    The Natural History Museum (U.K., England, London)
BNRH:  Buffelskloof Nature Reserve (South Africa. Mpumalanga Province, Lydenburg)
BO:    Research Centre for Biology (Indonesia, Cibinong)
BR:    Meise Botanic Garden (Belgium, Meise)
BRI:   Queensland Herbarium (Australia, Queensland, Brisbane)
BRIT:  Botanical Research Institute of Texas (U.S.A., Texas, Fort Worth)
BRLU:  Universite Libre de Bruxelles (Belgium, Bruxelles)
BRUN:  Brunei Forestry Centre (Brunei Darussalam, Belait)
CAS:   California Academy of Sciences (U.S.A., California, San Francisco)
CAY:   Institut de Recherche pour le Developpement (IRD) (French Guiana, Cayenne)
CEN:   Embrapa Recursos Genéticos e Biotecnologia - Embrapa Cenargen (Brazil, Distrito Federal, Brasília)
CNS:   Australian Tropical Herbarium (Australia, Queensland, Smithfield)
COL:   Universidad Nacional de Colombia (Colombia, D.C. Bogota)
CORD:  Herbario CORD (Argentina, Cordoba, Cordoba)
CS:    Colorado State University (U.S.A., Colorado, Fort Collins)
E:     Royal Botanic Garden Edinburgh (U.K., Scotland, Edinburgh)
EA:    National Museums of Kenya (Kenya, Nairobi)
ESA:   Universidade de São Paulo (Brazil, São Paulo, Piracicaba)
FHI:   Forestry Research Institute of Nigeria (Nigeria, Oyo, Ibadan)
FLAS:  Florida Museum of Natural History (U.S.A., Florida, Gainesville)
FRI:   Commonwealth Scientific and Industrial Research Organization (CSIRO) (Australia, 
       Australian Capital Territory, Canberra)
FTG:   Fairchild Tropical Botanic Garden (U.S.A., Florida, Miami)
G:     Conservatoire et Jardin botaniques de la Ville de Genève (Switzerland, Genève)
GC:    University of Ghana (Ghana, Legon)
GENT:  Ghent University (Belgium, Ghent)
GUAY:  Universidad de Guayaquil (Ecuador, Guayas, Guayaquil)
HAW:   University of Hawaii (U.S.A., Hawaii. Honolulu)
HBG:   University of Hamburg (Germany, Hamburg)
HITBC: Xishuangbanna Tropical Botanical Garden, Academia Sinica (People's Republic of 
       China, Yunnan, Xishuangbanna)
HNG:   Université. Gamal Abdel Nasser de Conakry (UGANC) (Republic of Guinea, Conakry)
HRCB:  Universidade Estadual Paulista (Brazil, São Paulo, Rio Claro)
HTW:   Universidad Nacional de la Patagonia San Juan Bosco - Sede Trelew (Argentina, 
       Chubut, Trelew)
HUA:   Universidad de Antioquia (Colombia, Antioquia, Medellín)
IBUG:  Universidad de Guadalajara (Mexico, Jalisco, Zapopan)
INB:   Instituto Nacional de Biodiversidad (Costa Rica, Santo Domingo)
INPA:  Instituto Nacional de Pesquisas da Amazônia (Brazil, Amazonas, Manaus)
JBL:   Jardín Botánico Lankester, Universidad de Costa Rica (Costa Rica, Cartago)
JRAU:  University of Johannesburg (South Africa, Gauteng Province, Johannesburg)
K:     Royal Botanic Gardens, Kew (U.K., Kew)
KKU:   Khon Kaen University (Thailand. Khon Kaen)
KPBG:  Kings Park and Botanic Garden (Australia, Western Australia, Perth)
KUN:   Kunming Institute of Botany, Chinese Academy of Sciences (People's Republic of 
       China, Yunnan, Kunming)
L:     Naturalis (Netherlands, Leiden)
LP:    Museo de La Plata (Argentina, Buenos Aires, La Plata)
LUH:   University of Lagos (Nigeria, Lagos, Lagos) 
M:     Botanische Staatssammlung München (Germany, München)
MAN:   Universitas Papua (Indonesia, Manokwari)
MAU:   The Mauritius Herbarium (Mauritius, Reduit)
MBML:  Instituto Nacional da Mata Atlântica - INMA (Brazil, Espírito Santo, Santa Teresa)
MEXU:  Universidad Nacional Autunoma de Mexico (Mexico, Mexico City, Mexico City)
MJG:   Johannes Gutenberg-Universität (Germany, Mainz)
MO:    Missouri Botanical Garden (U.S.A., Missouri, Saint Louis)
MPU:   Université de Montpellier (France, Montpellier)
MT:    Université de Montréal (Canada, Québec, Montréal)
NBG:   South African National Biodiversity Institute (South Africa, Western Cape Province, 
       Cape Town)
NCU:   University of North Carolina at Chapel Hill (U.S.A., North Carolina, Chapel Hill)
NE:    University of New England (Australia, New South Wales, Armidale)
NH:    South African National Biodiversity Institute (South Africa, KwaZulu-Natal 
       Province, Durban)
NOU:   Institut de Recherche pour le Développement (IRD) (New Caledonia, Noumea)
NSW:   Royal Botanic Gardens & Domain Trust (Australia, New South Wales, Sydney)
NY:    The New York Botanical Garden (U.S.A., New York, Bronx)
P:     Museum National d'Histoire Naturelle (France, Paris)
PERTH: Western Australian Herbarium (Australia, Western Australia, Perth)
PMA:   Universidad de Panamá (Panama, Panamá, Panamá)
PRE:   South African National Biodiversity Institute (South Africa, Gauteng Province,
       Pretoria)
RB:    Jardim Botânico do Rio de Janeiro (Brazil, Rio de Janeiro, Rio de Janeiro)
REU:   Universite de la Reunion (Reunion. Sainte-Clotilde)
SING:  Singapore Botanic Gardens (Singapore, Singapore, Singapore)
SP:    Instituto de Botânica (Brazil, São Paulo, São Paulo)
SYD:   University of Sydney (Australia, New South Wales, Sydney)
TEX:   University of Texas at Austin (U.S.A., Texas, Austin)
TNS:   National Museum of Nature and Science (Japan, Tsukuba)
TUM:   Technische Universität München (Germany, Freising)
U:     Naturalis (Netherlands, Leiden)
UB:    Universidade de Brasília (Brazil, Distrito Federal, Brasília)
UPCB:  Universidade Federal do Paraná (Brazil, Paraná, Curitiba)
UPR:   Botanical Garden of the University of Puerto Rico (Puerto Rico, Puerto Rico, Río Piedras)
UPS:   Museum of Evolution (Sweden. Uppsala)
UPTC:  Universidad Pedogógica y Tecnológica de Colombia (Colombia, Boyacá, Tunja)
US:    Smithsonian Institution (U.S.A., District of Columbia, Washington)
USJ:   Universidad de Costa Rica (Costa Rica, San José, San Pedro de Montes de Oca)
USM:   Universidad Nacional Mayor de San Marcos (Peru, Lima)
WAG:   Naturalis (Netherlands, Leiden)
Z:     Universität Zürich (Switzerland, Zürich)
ZSS:   Sukkulenten-Sammlung Zürich (Switzerland, Zürich)      
      
Where no information is available, a column contains the text '-'. 

Gene manifest
=============

This is a tab-delineated file, with columns as follows:

1. Gene ID
2. Exemplar gene name
3. Species from which the exemplar gene name has been taken
4. Database name (of the database from which the exemplar gene name was obtained)
5. Record ID (of the database record from which the exemplar gene name was obtained)
6. URL (to the online database record from which the exemplar gene name was obtained)
7. In tree? (values 'Y' or 'N') - indicates whether this gene was used to build the 
   species tree or not.  

The databases from which exemplar Gene names are taken are currently:

UniProtKB: The UniProt Knowlegebase (http://www.uniprot.org)

If no suitably named exemplar gene has been found, columns 2, 3 and 4 contains the text 
"-"