Index of /pub/paftol

Name                    Last modified      Size  Description
Parent Directory - README.txt 2022-01-31 10:57 23K releases/ 2023-11-30 11:45 - deprecated_releases/ 2023-05-25 11:35 - current_release/ 2023-04-18 18:07 -
README
======

The Kew Tree of Life Explorer allows users to explore evolutionary trees of life and to 
access the genomic data that underpin them. It is an output of the Plant and Fungal Trees 
of Life Project (PAFTOL) at the Royal Botanic Gardens, Kew 
(https://www.kew.org/science/our-science/projects/plant-and-fungal-trees-of-life), which 
aims to discover and disseminate the evolutionary history of all plant and fungal genera 
through phylogenetic approaches.  Tree of Life data are periodically released via the Kew 
secure file transfer protocol (SFTP) site (sftp.kew.org/pub/treeoflife) and is 
additionally made available for interactive web-based exploration at 
http://treeoflife.kew.org.

This document contains the following sections:

1. The Kew Tree of Life SFTP site  
   Overview of directory structure and files contained within each directory
2. File naming conventions
3. FASTA headers
4. Manifests
   Description of file formats for sequence_manifest.txt, deleted_sequences.txt,
   specimen_manifest.txt, revised_specimen_nomenclature.txt, gene_manifest.txt

1. The Kew Tree of Life SFTP site
=================================

|-- README.txt  This document
|
|-- current_release  A link to the current release of Kew Tree of Life Explorer
|
|-- releases  A directory containing all previous releases of Kew Tree of Life data
     |
     |-- <release number>  One directory for each Kew Tree of Life data release
     |
     |-- kew_tree_of_life_release_notes_<release_number>.txt  A document describing the
     |                                                        the contents of the release
     |
     |-- kew_tree_of_life_release_notes.txt  A symlink to the above file
     | 
     |-- sequence_manifest.txt  A document listing the accession numbers (in public
     |                          repositories) of all nucleotide sequence data used in the
     |                          release
     |
     |-- deleted_sequences.txt  A document listing the accession numbers (in public
     |                          repositories) of all nucleotide sequence data used in 
     |                          previous releases of the Kew Tree of Life that have not 
     |                          been used in this one
     |
     |-- specimen_manifest.txt  A document listing the scientific name of all
     |                          species included in this release, with additional
     |                          information about the specimens which have been
     |                          sampled
     |
     |--  revised_specimen_nomenclature.txt  A document identifying changes in
     |                                      specimen nomenclature between 
     |                                      successive releases
     |
     |-- gene_manifest.txt  A document listing the genes included in this release
     |
     |
     |-- fasta  A directory containing gene sequence in FASTA format. Sequences are
     |    |     generated from recovery processes, for a number of specified genes, and 
     |    |     according to a specified method
     |    |
     |    |-- alignments A directory containing alignment data for each gene, in aligned 
     |    |   FASTA format.
     |    |
     |    |-- by_gene  A directory containing files containing all assembled sequences for
     |    |            a given gene
     |    |
     |    |-- by_recovery  A directory containing files containing all assembled sequences
     |                     for a given recovery
     |
     |-- tree  A directory containing tree files in Newick format for genes and speices
     |    |
     |    |-- gene  A directory containing tree files in Newick format for each gene used 
     |    |         to build the species tree
     |    |
     |    |-- species  A directory containing the species tree file for this release in 
     |                   Newick format
     |                  
     |-- nex
     |    | 
     |    |-- species  A directory containing the species tree file for this  release in 
     |                 NEXUS format
     |               
     |-- svg
          |
          |-- species  A directory containing the species tree file for this  release in 
                       SVG format
           
2. File naming conventions
==========================

Files in fasta/alignments
-------------------------

gene_id.<molecule_type>.aln.fasta

An alignment is built for each gene from the corresponding sequences in fasta/by_gene. 
Sequences with poor coverage of the alignment have been removed as described in Baker et 
al. (https://doi.org/10.1093/sysbio/syab035).

Files in fasta/by_gene
----------------------

<gene_id>.<molecule_type>.fasta

The Gene ID identifies the pan-species gene concept, and is taken from the Angiosperms353 
data set (Johnson et. al, https://doi.org/10.1093/sysbio/syy086).

Molecule types used in this release:

DNA

Protein files may be provided in future releases.

Files in fasta/by_recovery
--------------------------

<repository_name>.<sequence_id>.<species_name>.<sequence set>.fasta

A “recovery” is a bioinformatic analysis of a set of sequence data from a single species, 
yielding a set of gene sequences. All sequence sets used for recoveries are accessioned in
a public repository.

Repositories in use in this release:

INSDC: The ENA/GenBank/DDBJ International Nucleotide Sequence Database Collaboration 
       (INSDC). 
oneKP: The data repository of the One Thousand Plant Transcriptomes Initiative, available
       at 
       https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/
       oneKP_capstone_2019

The sequence_id is the identifier used for those sequences within the named repository.

Sequence sets in use in this release:

a353: the Angiosperms353 gene set 
(see Johnson et al. 2019, https://doi.org/10.1093/sysbio/syy086)

The files in this directory always contain DNA sequence.  It is not anticipated that 
protein sequence files will be made available on a per recovery basis.

Files in tree/gene
-------------------

gene_id.tree

A gene tree for each gene from the corresponding alignments in fasta/alignments, in 
Newick format.

Nodes are labelled as follows:

<Order>_<Family>_<Genus>_<Species>_<Sequence_ID>

Where the sequence ID is the identifier of the sequence derived from the sample as stored
in a sequence repository.  Further details are provided in the sequence manifest.

Files in tree/species
---------------------

treeoflife.<release_id>.tree
treeoflife.current.tree

A file containing the Kew "tree of life" for all species included in this release in 
Newick format.  The file name contains the release ID; a symlink to the current tree is 
provided with every release for convenient download.

treeoflife.all_support_values.<release_id>.tree
treeoflife.all_support_values.current.tree

A file containing the Kew "tree of life" for all species included in this release in 
Newick format, with the inclusion of all support value data for each node defined as 
follows:

q1:  quartet support for the main topology
q2:  quartet support for the first alternative topology
q3:  quartet support for the second alternative topology
f1:  number of quartet trees in all the gene trees that support the main topology
f2:  number of quartet trees in all the gene trees that support the first alternative 
     topology
f3:  number of quartet trees in all the gene trees that support the second alternative 
     topology
pp1: local posterior probability for the main topology
pp2: local posterior probability for the first alternative topology,
pp3: local posterior probability for the second alternative topology
QC:  number of quartets defined around each branch
EN:  effective number of genes for the branch.

Nodes are labelled (in both trees) as follows:

<Order>_<Family>_<Genus>_<Species>_<Sequence_ID>

Where the sequence ID is the identifier of the sequence derived from the sample as stored
in a sequence repository.  Further details are provided in the sequence manifest.

Files in nex
------------

treeoflife.<release_id>.nex
treeoflife.current.nex

A file containing the Kew "tree of life" for all species included in this release in 
NEXUS format.  The file name contains the release ID; a symlinked file to the current 
tree is provided with every release for convenient download.

Files in SVG
------------

treeoflife.<release_id>.svg
treeoflife.current.svg

A file containing the Kew "tree of life" for all species included in this release in 
Scalable Vector Graphics (SVG) format.  The file name contains the release ID; a 
symlinked file to the current tree is provided with every release for convenient download.

3. FASTA headers
================

Sequences in FASTA files have headers as follows:

\><gene_id> Gene_Name:<gene_name> Species:<species_name> Repository:<repository_name> 
Sequence_ID:<sequence_id>

The Gene ID identifies the pan-species gene concept, and is taken from the Angiosperm 353 
data set (Johnson et. al, https://doi.org/10.1093/sysbio/syy086).

The gene name is an exemplar gene name for the gene that has been recovered (i.e., in use 
for this gene in at least one of the species from which the gene has been recovered).  It 
is not necessarily the name by which the gene is known in the recovered species.  All 
instances of this gene are assigned the same name in a single release.  The gene name is 
not guaranteed to be stable between releases. To identify the same gene in successive 
releases, use the Gene ID. If no suitably named exemplar gene has been found the gene 
name is given as ‘NA’.

The species name comprises genus and species names in accordance with scientific 
convention and uses underscores in place of spaces.

The sequence repository and sequence identifier as defined as in the names of the files
in the fasta/by_recovery directory (see above).

4. Manifests
============

Deleted Sequences
-----------------

This (deleted_sequences.txt) is a tab-delineated file, with columns as follows:

1. Repository name
2. Sequence identifier
3. Sequence type. One of genome, transcript, read.
4. Scientific species name
5. Release ID where first included
6. Release ID from which sequence was deleted
7. Reason for deletion

The values of “Reason for deletion” currently are: 

Duplicated_sequencing_run: a different sequencing run has been chosen to represent this
                           sample.

Failed_family_identification: the taxonomic identity of the sequence generated was 
                              inconsistent with the sequence obtained at known barcoding 
                              loci, or the sample placed in the wrong family in the 
                              preliminary tree, in accordance with the procedure described 
                              in Baker et al. (https://doi.org/10.1093/sysbio/syab035).
                              
Permanently_excluded: the specimen was excluded pre-analysis due to expert review. Either 
                      the expert has seen the specimen and considers it does not match the
                      sample  identification, or, in previous analyses, the sample did not
                      lie in a credible place in the phylogeny.

Sequence manifest
-----------------

This (sequence.manifest.txt) is a tab-delineated file, with columns as follows:

1. Repository name
2. Sequence identifier
3. Sequence type. One of annotated_genome, unannotated_genome, transcript, read
4. Scientific species name
5. Project name.  One of PAFTOL, oneKP, GAP. 
A '-' is used when the sequence has not been generated by a known phylogenetic project.

The values of 'Repository name' currently in use are:

INSDC: The ENA/GenBank/DDBJ International Nucleotide Sequence Database Collaboration 
       (INSDC) 

oneKP: The data repository of the One Thousand Plant Transcriptomes        
       Initiative, available here: 
       https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/
       oneKP_capstone_2019

Revised specimen nomenclature
----------------------------- 

This (revised_specimen_nomenclature.txt) is a tab delineated file, with columns as
follows:

1.	Repository_name	
2.	Sequence_identifier
3.	Old species name	
4.	New species name	
5.	Release where new name first used

Specimen manifest
------------------

This is a tab-delineated file, with columns as follows:

1. Scientific species name
2. Collection ID (of the specimen used); from Index Herbarium
3. Specimen ID or barcode
4. Voucher information
5. Specimen URL (to an online catalogue entry for that specimen, where available)

The values of 'Collection ID' currently in use are:

AD:    State Herbarium of South Australia (Australia, South Australia, Adelaide)
APSC:  Austin Peay State University (U.S.A., Tennessee, Clarksville)
BA:    Museo Argentino de Ciencias Naturales "Bernardino Rivadavia" (Argentina, Buenos 
       Aires)
BC:    Institut Botanic de Barcelona (Spain, Barcelona)
BCN:   University of Barcelona (Spain, Barcelona)
BCRU:  Universidad Nacional del Comahue (Argentina, Río Negro, San Carlos de Bariloche)
BG:    University of Bergen (Norway, Bergen)
BH:    Cornell University (U.S.A., New York, Ithaca)
BHCB:  Universidade Federal de Minas Gerais (Brazil, Minas Gerais, Belo Horizonte)
BHO:   Ohio University (U.S.A. Ohio. Athens)
BNRH:  Buffelskloof Nature Reserve (South Africa, Mpumalanga Province, Lydenburg)
BISH:  Bishop Museum (U.S.A, Hawaii, Honolulu)
BJFC:  Beijing Forestry University (People's Republic of China, Beijing)
BKF:   Department of National Parks, Wildlife and Plant Conservation (Thailand, Bangkok,
       Chatuchak)
BM:    The Natural History Museum (U.K., England, London)
BNRH:  Buffelskloof Nature Reserve (South Africa. Mpumalanga Province, Lydenburg)
BONN:  University of Bonn (Germany, Bonn)
BR:    Meise Botanic Garden (Belgium, Meise)
BRI:   Queensland Herbarium (Australia, Queensland, Brisbane)
BRIT:  Botanical Research Institute of Texas (U.S.A., Texas, Fort Worth)
BRLU:  Universite Libre de Bruxelles (Belgium, Bruxelles)
BRUN:  Brunei Forestry Centre (Brunei Darussalam, Belait)
BZ:    Herbarium Bogoriense (Indonesia, Java, Bogor)
C:     University of Copenhagen (Denmark, Copenhagen)
CAN:   Canadian Museum of Nature (Canada, Quebec, Gatineau)
CANB:  Australian National Herbarium (Australia, Australian Capital Territory, Canberra)
CAS:   California Academy of Sciences (U.S.A., California, San Francisco)
CBG:   Australian National Herbarium (Australia, Australian Capital Territory, Canberra)
CNS:   Australian Tropical Herbarium (Australia, Queensland, Smithfield)
COL:   Universidad Nacional de Colombia (Colombia, D.C. Bogota)
CONC:  Universidad de Concepción (Chile, Concepcion)
CORD:  Herbario CORD (Argentina, Córdoba, Cordoba)
CS:    Colorado State University (U.S.A., Colorado, Fort Collins)
CUVC:  Universidad del Valle (Colombia, Valle del Cauca, Cali)
DNA:   Department of Environment Parks and Water Security (Australia, Northern Territory,
       Palmerston)
E:     Royal Botanic Garden Edinburgh (U.K., Scotland, Edinburgh)
EA:    National Museums of Kenya (Kenya, Nairobi)
F:     Field Museum of Natural History (U.S.A., Illinois, Chicago)
FLAS:  Florida Museum of Natural History (U.S.A., Florida, Gainesville)
FMB:   Instituto de Investigación de Recursos Biológicos Alexander von Humboldt    
       Colombia, Villa de Leyva)
FTG:   Fairchild Tropical Botanic Garden (U.S.A., Florida, Miami)
G:     Conservatoire et Jardin botaniques de la Ville de Geneve (Switzerland, Geneve)
GB:    University of Gothenburg (Sweden, Goteborg)
GC:    University of Ghana (Ghana, Legon)
GENT:  Ghent University (Belgium, Ghent)
GH:    Harvard University (U.S.A., Massachusetts, Cambridge)
GOET:  Universität Göttingen (Germany, Gottingen)
GUAY:  Universidad de Guayaquil (Ecuador, Guayas, Guayaquil)
GZU:   Karl-Franzens-Universität Graz (Austria, Graz)
HAW:   University of Hawaii (U.S.A., Hawaii. Honolulu)
HITBC: Xishuangbanna Tropical Botanical Garden, Academia Sinica (People's Republic of 
       China, Yunnan, Xishuangbanna)
HNG:   Universite Gamal Abdel Nasser de Conakry (UGANC) (Republic of Guinea, Conakry)
HO:    Tasmanian Museum and Art Gallery (Australia, Tasmania, Hobart)
HPUJ:  Pontificia Universidad Javeriana (Colombia, D.C., Santafé de Bogotá)
HRCB:  Universidade Estadual Paulista (Brazil, São Paulo, Rio Claro)
HTW:   Universidad Nacional de la Patagonia San Juan Bosco - Sede Trelew (Argentina, 
       Chubut, Trelew)
HUA:   Universidad de Antioquia (Colombia, Antioquia, Medellín)
HUAZ:  Universidad de la Amazonia (Colombia, Caquetá, Florencia)
HUEFS: Universidade Estadual de Feira de Santana (Brazil, Bahia, Feira de Santana)
HUFU:  Universidade Federal de Uberlandia (Brazil, Minas Gerais, Uberlândia)
IBSC:  South China Botanical Garden (People's Republic of China, Guangdong, Guangzhou)
IBUG:  Universidad de Guadalajara (Mexico, Jalisco, Zapopan)
ICN:   Universidade Federal do Rio Grande do Sul (Brazil, Rio Grande do Sul, Porto 
       Alegre)
IEB:   Instituto de Ecología, A.C. (Mexico, Michoacán, Pátzcuaro)
INB:   Instituto Nacional de Biodiversidad (Costa Rica, Santo Domingo)
INPA:  Instituto Nacional de Pesquisas da Amazônia (Brazil, Amazonas, Manaus)
JBB:   Jardín Botanico José Celestino Mutis (Colombia, Bogotá, D.C., Bogota, D.C.)
JBL:   Jardín Botanico Lankester, Universidad de Costa Rica (Costa Rica, Cartago)
JRAU:  University of Johannesburg (South Africa, Gauteng Province, Johannesburg)
K:     Royal Botanic Gardens, Kew (U.K., Kew)
KAS:   University of Kassel (Germany, Kassel)
KLU:   University of Malaya (Malaysia, Kuala Lumpur)
KRB:   Kebun Raya Bogor (Indonesia, Bogor)
KUN:   Kunming Institute of Botany, Chinese Academy of Sciences (People's Republic of 
       China, Yunnan, Kunming)
L:     Naturalis	(Netherlands, Leiden)
LISC:  Instituto de Investigaçao Científica Tropical (Portugal, Lisboa)
LP:    Museo de La Plata (Argentina, Buenos Aires, La Plata)
LPB:   Herbario Nacional de Bolivia, Universidad Mayor de San Andres (Bolivia, La Paz)
LYJB:  Jardin botanique de la ville de Lyon (France, Lyon)
M:     Botanische Staatssammlung München (Germany, München)
MA:    Real Jardín Botanico (Spain, Madrid, Madrid)
MAU:   The Mauritius Herbarium (Mauritius, Reduit)
MBA:   Environmental Protection Agency (Australia, Queensland, Mareeba)
MBML:  Instituto Nacional da Mata Atlântica - INMA (Brazil, Espírito Santo, Santa Teresa)
MEDEL: Universidad Nacional de Colombia - Sede de Medellín (Colombia, Antioquia, 
       Medellín)
MEL:   Royal Botanic Gardens Victoria (Australia, Victoria, Melbourne)
MELU:  University of Melbourne (Australia, Victoria, Parkville)
MICH:  University of Michigan (U.S.A., Michigan, Ann Arbor)
MIN:   University of Minnesota (U.S.A., Minnesota, St, Paul)
MJG:   Johannes Gutenberg-Universitaet (Germany, Mainz)
MO:    Missouri Botanical Garden (U.S.A., Missouri, Saint Louis)
MT:    Universite de Montréal (Canada, Québec, Montreal)
MY:    Universidad Central de Venezuela (Venezuela, Aragua, Maracay)
N:     Nanjing University (People's Republic of China, Jiangsu, Nanjing)
NBG:   South African National Biodiversity Institute (South Africa,Western Cape Province, 
       Cape Town)
NCU:   University of North Carolina at Chapel Hill (U.S.A., North Carolina, Chapel Hill)
NCY:   Conservatoire et Jardins Botaniques de Nancy, Universite de Nancy I (France, 
       Nancy)
NE:    University of New England (Australia, New South Wales, Armidale)
NH:    South African National Biodiversity Institute (South Africa, KwaZulu-Natal 
       Province, Durban)
NHM:   University of Nottingham (U.K., England, Nottingham)
NHMR:  Natural History Museum Rijeka (Croatia, Rijeka)
NMNL:  Natuurmuseum Nijmegen e.o. (Netherlands, Nijmegen)
NOU:   Institut de Recherche pour le Development (IRD) (New Caledonia, Noumea)
NSW:   Royal Botanic Gardens & Domain Trust (Australia, New South Wales, Sydney)
NT:    Department of Environment, Parks and Water Security (Australia, Northern 
       Territory, Alice Springs)
NY:    The New York Botanical Garden (U.S.A., New York, Bronx)
ORT:   Instituto Canario de Investigaciones Agrarias (ICIA) (Spain, Canary Islands, 
       Puerto de la Cruz)
OS:    Ohio State University (U.S.A., Ohio, Columbus)
P:     Museum National d'Histoire Naturelle (France, Paris)
PERTH: Western Australian Herbarium (Australia, Western Australia, Perth)
PMA:   Universidad de Panamá (Panama, Panamá, Panamá)
PG:    Plant Gateway (U.K., Surrey, Kingston-upon-Thames)
PH:    Academy of Natural Sciences (U.S.A., Pennsylvania, Philadelphia)
PRE:   South African National Biodiversity Institute (South Africa, Gauteng Province, 
       Pretoria)
PTBG:  National Tropical Botanical Garden (U.S.A., Hawaii, Kalaheo)
QCA:   Pontificia Universidad Catolica del Ecuador (Ecuador, Quito)
QRS:   CSIRO (Australia, Queensland, Atherton)
RB:    Jardim Botanico do Rio de Janeiro (Brazil, Rio de Janeiro, Rio de Janeiro)
REU:   Universite de la Reunion (Reunion. Sainte-Clotilde)
SALA:  Universidad de Salamanca (Spain, Salamanca)
SAR:   Department of Forestry (Malaysia, Sarawak, Kuching)
SGO:   Museo Nacional de Historia Natural (Chile, Santiago)
SI:    Instituto de Botanica Darwinion (Argentina, Buenos Aires, San Isidro)
SING:  Singapore Botanic Gardens (Singapore, Singapore, Singapore)
SP:    Instituto de Botânica (Brazil, São Paulo, São Paulo)
SPF:   Universidade de Sao Paulo (Brazil, São Paulo, São Paulo)
SPFR:  Universidade de Sao Paulo (Brazil, São Paulo, Ribeirao Preto)
SUVA:  University of the South Pacific (Fiji, Suva)TEX: University of Texas at Austin 
       (U.S.A., Texas, Austin)
TAN:   Parc Botanique et Zoologique de Tsimbazaza (PBZT) (Madagascar, Antananarivo)
TCD:   Trinity College (Ireland, Dublin)
TEX:   University of Texas at Austin (U.S.A., Texas, Austin)
TNS:   National Museum of Nature and Science	(Japan, Tsukuba)
TUM:   Technische Universität München (Germany, Freising)
U:     Naturalis (Netherlands, Leiden)
UAPC:  University of Alberta (Canada, Alberta, Edmonton)
UB:    Universidade de Brasília (Brazil, Distrito Federal, Brasília)
UEC:   Universidade Estadual de Campinas (Brazil, Campinas)
UIS:   Universidad Industrial de Santander (Colombia, Santander, Bucaramanga)
UPCB:  Universidade Federal do Paraná (Brazil, Paraná, Curitiba)
UPR:   Botanical Garden of the University of Puerto Rico (Puerto Rico, Puerto Rico, Río 
       Piedras)
UPS:   Museum of Evolution (Sweden. Uppsala)
UPTC:  Universidad Pedogógica y Tecnológica de Colombia (Colombia, Boyacá, Tunja)
US:    Smithsonian Institution (U.S.A., District of Columbia, Washington)
USM:   Universidad Nacional Mayor de San Marcos (Peru, Lima)
WTU:   University of Washington (U.S.A., Washington, Seattle)
YA:    National Herbarium of Cameroon (Cameroon, Yaounde)
ZSS:   Sukkulenten-Sammlung Zürich (Switzerland, Zürich)      
      
Where no information is available, a column contains the text '-'. 

Gene manifest
-------------

This (gene_manifest.txt) is a tab-delineated file, with columns as follows:

1. Gene ID
2. Exemplar gene name
3. Species from which the exemplar gene name has been taken
4. Database name (of the database from which the exemplar gene name was obtained)
5. Record ID (of the database record from which the exemplar gene name was obtained)
6. URL (to the online database record from which the exemplar gene name was obtained)
7. In tree? (values 'Y' or 'N') - indicates whether this gene was used to build the 
   species tree or not.  

The databases from which exemplar Gene names are taken are currently:

UniProtKB: The UniProt Knowledgebase (http://www.uniprot.org)

If no suitably named exemplar gene has been found, columns 2 – 6 contain the text 
‘-’.