README ====== The Kew Tree of Life Explorer allows users to explore evolutionary trees of life and to access the genomic data that underpin them. It is an output of the Plant and Fungal Trees of Life Project (PAFTOL) at the Royal Botanic Gardens, Kew (https://www.kew.org/science/our-science/projects/plant-and-fungal-trees-of-life), which aims to discover and disseminate the evolutionary history of all plant and fungal genera through phylogenetic approaches. Tree of Life data are periodically released via the Kew secure file transfer protocol (SFTP) site (sftp.kew.org/pub/treeoflife) and is additionally made available for interactive web-based exploration at http://treeoflife.kew.org. This document contains the following sections: 1. The Kew Tree of Life SFTP site Overview of directory structure and files contained within each directory 2. File naming conventions 3. FASTA headers 4. Manifests Description of file formats for sequence_manifest.txt, deleted_sequences.txt, specimen_manifest.txt, revised_specimen_nomenclature.txt, gene_manifest.txt 1. The Kew Tree of Life SFTP site ================================= |-- README.txt This document | |-- current_release A link to the current release of Kew Tree of Life Explorer | |-- releases A directory containing all previous releases of Kew Tree of Life data | |-- One directory for each Kew Tree of Life data release | |-- kew_tree_of_life_release_notes_.txt A document describing the | the contents of the release | |-- kew_tree_of_life_release_notes.txt A symlink to the above file | |-- sequence_manifest.txt A document listing the accession numbers (in public | repositories) of all nucleotide sequence data used in the | release | |-- deleted_sequences.txt A document listing the accession numbers (in public | repositories) of all nucleotide sequence data used in | previous releases of the Kew Tree of Life that have not | been used in this one | |-- specimen_manifest.txt A document listing the scientific name of all | species included in this release, with additional | information about the specimens which have been | sampled | |-- revised_specimen_nomenclature.txt A document identifying changes in | specimen nomenclature between | successive releases | |-- gene_manifest.txt A document listing the genes included in this release | | |-- fasta A directory containing gene sequence in FASTA format. Sequences are | | generated from recovery processes, for a number of specified genes, and | | according to a specified method | | | |-- alignments A directory containing alignment data for each gene, in aligned | | FASTA format. | | | |-- by_gene A directory containing files containing all assembled sequences for | | a given gene | | | |-- by_recovery A directory containing files containing all assembled sequences | for a given recovery | |-- tree A directory containing tree files in Newick format for genes and speices | | | |-- gene A directory containing tree files in Newick format for each gene used | | to build the species tree | | | |-- species A directory containing the species tree file for this release in | Newick format | |-- nex | | | |-- species A directory containing the species tree file for this release in | NEXUS format | |-- svg | |-- species A directory containing the species tree file for this release in SVG format 2. File naming conventions ========================== Files in fasta/alignments ------------------------- gene_id..aln.fasta An alignment is built for each gene from the corresponding sequences in fasta/by_gene. Sequences with poor coverage of the alignment have been removed as described in Baker et al. (https://doi.org/10.1093/sysbio/syab035). Files in fasta/by_gene ---------------------- ..fasta The Gene ID identifies the pan-species gene concept, and is taken from the Angiosperms353 data set (Johnson et. al, https://doi.org/10.1093/sysbio/syy086). Molecule types used in this release: DNA Protein files may be provided in future releases. Files in fasta/by_recovery -------------------------- ....fasta A “recovery” is a bioinformatic analysis of a set of sequence data from a single species, yielding a set of gene sequences. All sequence sets used for recoveries are accessioned in a public repository. Repositories in use in this release: INSDC: The ENA/GenBank/DDBJ International Nucleotide Sequence Database Collaboration (INSDC). oneKP: The data repository of the One Thousand Plant Transcriptomes Initiative, available at https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/ oneKP_capstone_2019 The sequence_id is the identifier used for those sequences within the named repository. Sequence sets in use in this release: a353: the Angiosperms353 gene set (see Johnson et al. 2019, https://doi.org/10.1093/sysbio/syy086) The files in this directory always contain DNA sequence. It is not anticipated that protein sequence files will be made available on a per recovery basis. Files in tree/gene ------------------- gene_id.tree A gene tree for each gene from the corresponding alignments in fasta/alignments, in Newick format. Nodes are labelled as follows: ____ Where the sequence ID is the identifier of the sequence derived from the sample as stored in a sequence repository. Further details are provided in the sequence manifest. Files in tree/species --------------------- treeoflife..tree treeoflife.current.tree A file containing the Kew "tree of life" for all species included in this release in Newick format. The file name contains the release ID; a symlink to the current tree is provided with every release for convenient download. treeoflife.all_support_values..tree treeoflife.all_support_values.current.tree A file containing the Kew "tree of life" for all species included in this release in Newick format, with the inclusion of all support value data for each node defined as follows: q1: quartet support for the main topology q2: quartet support for the first alternative topology q3: quartet support for the second alternative topology f1: number of quartet trees in all the gene trees that support the main topology f2: number of quartet trees in all the gene trees that support the first alternative topology f3: number of quartet trees in all the gene trees that support the second alternative topology pp1: local posterior probability for the main topology pp2: local posterior probability for the first alternative topology, pp3: local posterior probability for the second alternative topology QC: number of quartets defined around each branch EN: effective number of genes for the branch. Nodes are labelled (in both trees) as follows: ____ Where the sequence ID is the identifier of the sequence derived from the sample as stored in a sequence repository. Further details are provided in the sequence manifest. Files in nex ------------ treeoflife..nex treeoflife.current.nex A file containing the Kew "tree of life" for all species included in this release in NEXUS format. The file name contains the release ID; a symlinked file to the current tree is provided with every release for convenient download. Files in SVG ------------ treeoflife..svg treeoflife.current.svg A file containing the Kew "tree of life" for all species included in this release in Scalable Vector Graphics (SVG) format. The file name contains the release ID; a symlinked file to the current tree is provided with every release for convenient download. 3. FASTA headers ================ Sequences in FASTA files have headers as follows: \> Gene_Name: Species: Repository: Sequence_ID: The Gene ID identifies the pan-species gene concept, and is taken from the Angiosperm 353 data set (Johnson et. al, https://doi.org/10.1093/sysbio/syy086). The gene name is an exemplar gene name for the gene that has been recovered (i.e., in use for this gene in at least one of the species from which the gene has been recovered). It is not necessarily the name by which the gene is known in the recovered species. All instances of this gene are assigned the same name in a single release. The gene name is not guaranteed to be stable between releases. To identify the same gene in successive releases, use the Gene ID. If no suitably named exemplar gene has been found the gene name is given as ‘NA’. The species name comprises genus and species names in accordance with scientific convention and uses underscores in place of spaces. The sequence repository and sequence identifier as defined as in the names of the files in the fasta/by_recovery directory (see above). 4. Manifests ============ Deleted Sequences ----------------- This (deleted_sequences.txt) is a tab-delineated file, with columns as follows: 1. Repository name 2. Sequence identifier 3. Sequence type. One of genome, transcript, read. 4. Scientific species name 5. Release ID where first included 6. Release ID from which sequence was deleted 7. Reason for deletion The values of “Reason for deletion” currently are: Duplicated_sequencing_run: a different sequencing run has been chosen to represent this sample. Failed_family_identification: the taxonomic identity of the sequence generated was inconsistent with the sequence obtained at known barcoding loci, or the sample placed in the wrong family in the preliminary tree, in accordance with the procedure described in Baker et al. (https://doi.org/10.1093/sysbio/syab035). Permanently_excluded: the specimen was excluded pre-analysis due to expert review. Either the expert has seen the specimen and considers it does not match the sample identification, or, in previous analyses, the sample did not lie in a credible place in the phylogeny. Sequence manifest ----------------- This (sequence.manifest.txt) is a tab-delineated file, with columns as follows: 1. Repository name 2. Sequence identifier 3. Sequence type. One of annotated_genome, unannotated_genome, transcript, read 4. Scientific species name 5. Project name. One of PAFTOL, oneKP, GAP. A '-' is used when the sequence has not been generated by a known phylogenetic project. The values of 'Repository name' currently in use are: INSDC: The ENA/GenBank/DDBJ International Nucleotide Sequence Database Collaboration (INSDC) oneKP: The data repository of the One Thousand Plant Transcriptomes Initiative, available here: https://datacommons.cyverse.org/browse/iplant/home/shared/commons_repo/curated/ oneKP_capstone_2019 Revised specimen nomenclature ----------------------------- This (revised_specimen_nomenclature.txt) is a tab delineated file, with columns as follows: 1. Repository_name 2. Sequence_identifier 3. Old species name 4. New species name 5. Release where new name first used Specimen manifest ------------------ This is a tab-delineated file, with columns as follows: 1. Scientific species name 2. Collection ID (of the specimen used); from Index Herbarium 3. Specimen ID or barcode 4. Voucher information 5. Specimen URL (to an online catalogue entry for that specimen, where available) The values of 'Collection ID' currently in use are: AD: State Herbarium of South Australia (Australia, South Australia, Adelaide) APSC: Austin Peay State University (U.S.A., Tennessee, Clarksville) BA: Museo Argentino de Ciencias Naturales "Bernardino Rivadavia" (Argentina, Buenos Aires) BC: Institut Botanic de Barcelona (Spain, Barcelona) BCN: University of Barcelona (Spain, Barcelona) BCRU: Universidad Nacional del Comahue (Argentina, Río Negro, San Carlos de Bariloche) BG: University of Bergen (Norway, Bergen) BH: Cornell University (U.S.A., New York, Ithaca) BHCB: Universidade Federal de Minas Gerais (Brazil, Minas Gerais, Belo Horizonte) BHO: Ohio University (U.S.A. Ohio. Athens) BNRH: Buffelskloof Nature Reserve (South Africa, Mpumalanga Province, Lydenburg) BISH: Bishop Museum (U.S.A, Hawaii, Honolulu) BJFC: Beijing Forestry University (People's Republic of China, Beijing) BKF: Department of National Parks, Wildlife and Plant Conservation (Thailand, Bangkok, Chatuchak) BM: The Natural History Museum (U.K., England, London) BNRH: Buffelskloof Nature Reserve (South Africa. Mpumalanga Province, Lydenburg) BONN: University of Bonn (Germany, Bonn) BR: Meise Botanic Garden (Belgium, Meise) BRI: Queensland Herbarium (Australia, Queensland, Brisbane) BRIT: Botanical Research Institute of Texas (U.S.A., Texas, Fort Worth) BRLU: Universite Libre de Bruxelles (Belgium, Bruxelles) BRUN: Brunei Forestry Centre (Brunei Darussalam, Belait) BZ: Herbarium Bogoriense (Indonesia, Java, Bogor) C: University of Copenhagen (Denmark, Copenhagen) CAN: Canadian Museum of Nature (Canada, Quebec, Gatineau) CANB: Australian National Herbarium (Australia, Australian Capital Territory, Canberra) CAS: California Academy of Sciences (U.S.A., California, San Francisco) CBG: Australian National Herbarium (Australia, Australian Capital Territory, Canberra) CNS: Australian Tropical Herbarium (Australia, Queensland, Smithfield) COL: Universidad Nacional de Colombia (Colombia, D.C. Bogota) CONC: Universidad de Concepción (Chile, Concepcion) CORD: Herbario CORD (Argentina, Córdoba, Cordoba) CS: Colorado State University (U.S.A., Colorado, Fort Collins) CUVC: Universidad del Valle (Colombia, Valle del Cauca, Cali) DNA: Department of Environment Parks and Water Security (Australia, Northern Territory, Palmerston) E: Royal Botanic Garden Edinburgh (U.K., Scotland, Edinburgh) EA: National Museums of Kenya (Kenya, Nairobi) F: Field Museum of Natural History (U.S.A., Illinois, Chicago) FLAS: Florida Museum of Natural History (U.S.A., Florida, Gainesville) FMB: Instituto de Investigación de Recursos Biológicos Alexander von Humboldt Colombia, Villa de Leyva) FTG: Fairchild Tropical Botanic Garden (U.S.A., Florida, Miami) G: Conservatoire et Jardin botaniques de la Ville de Geneve (Switzerland, Geneve) GB: University of Gothenburg (Sweden, Goteborg) GC: University of Ghana (Ghana, Legon) GENT: Ghent University (Belgium, Ghent) GH: Harvard University (U.S.A., Massachusetts, Cambridge) GOET: Universität Göttingen (Germany, Gottingen) GUAY: Universidad de Guayaquil (Ecuador, Guayas, Guayaquil) GZU: Karl-Franzens-Universität Graz (Austria, Graz) HAW: University of Hawaii (U.S.A., Hawaii. Honolulu) HITBC: Xishuangbanna Tropical Botanical Garden, Academia Sinica (People's Republic of China, Yunnan, Xishuangbanna) HNG: Universite Gamal Abdel Nasser de Conakry (UGANC) (Republic of Guinea, Conakry) HO: Tasmanian Museum and Art Gallery (Australia, Tasmania, Hobart) HPUJ: Pontificia Universidad Javeriana (Colombia, D.C., Santafé de Bogotá) HRCB: Universidade Estadual Paulista (Brazil, São Paulo, Rio Claro) HTW: Universidad Nacional de la Patagonia San Juan Bosco - Sede Trelew (Argentina, Chubut, Trelew) HUA: Universidad de Antioquia (Colombia, Antioquia, Medellín) HUAZ: Universidad de la Amazonia (Colombia, Caquetá, Florencia) HUEFS: Universidade Estadual de Feira de Santana (Brazil, Bahia, Feira de Santana) HUFU: Universidade Federal de Uberlandia (Brazil, Minas Gerais, Uberlândia) IBSC: South China Botanical Garden (People's Republic of China, Guangdong, Guangzhou) IBUG: Universidad de Guadalajara (Mexico, Jalisco, Zapopan) ICN: Universidade Federal do Rio Grande do Sul (Brazil, Rio Grande do Sul, Porto Alegre) IEB: Instituto de Ecología, A.C. (Mexico, Michoacán, Pátzcuaro) INB: Instituto Nacional de Biodiversidad (Costa Rica, Santo Domingo) INPA: Instituto Nacional de Pesquisas da Amazônia (Brazil, Amazonas, Manaus) JBB: Jardín Botanico José Celestino Mutis (Colombia, Bogotá, D.C., Bogota, D.C.) JBL: Jardín Botanico Lankester, Universidad de Costa Rica (Costa Rica, Cartago) JRAU: University of Johannesburg (South Africa, Gauteng Province, Johannesburg) K: Royal Botanic Gardens, Kew (U.K., Kew) KAS: University of Kassel (Germany, Kassel) KLU: University of Malaya (Malaysia, Kuala Lumpur) KRB: Kebun Raya Bogor (Indonesia, Bogor) KUN: Kunming Institute of Botany, Chinese Academy of Sciences (People's Republic of China, Yunnan, Kunming) L: Naturalis (Netherlands, Leiden) LISC: Instituto de Investigaçao Científica Tropical (Portugal, Lisboa) LP: Museo de La Plata (Argentina, Buenos Aires, La Plata) LPB: Herbario Nacional de Bolivia, Universidad Mayor de San Andres (Bolivia, La Paz) LYJB: Jardin botanique de la ville de Lyon (France, Lyon) M: Botanische Staatssammlung München (Germany, München) MA: Real Jardín Botanico (Spain, Madrid, Madrid) MAU: The Mauritius Herbarium (Mauritius, Reduit) MBA: Environmental Protection Agency (Australia, Queensland, Mareeba) MBML: Instituto Nacional da Mata Atlântica - INMA (Brazil, Espírito Santo, Santa Teresa) MEDEL: Universidad Nacional de Colombia - Sede de Medellín (Colombia, Antioquia, Medellín) MEL: Royal Botanic Gardens Victoria (Australia, Victoria, Melbourne) MELU: University of Melbourne (Australia, Victoria, Parkville) MICH: University of Michigan (U.S.A., Michigan, Ann Arbor) MIN: University of Minnesota (U.S.A., Minnesota, St, Paul) MJG: Johannes Gutenberg-Universitaet (Germany, Mainz) MO: Missouri Botanical Garden (U.S.A., Missouri, Saint Louis) MT: Universite de Montréal (Canada, Québec, Montreal) MY: Universidad Central de Venezuela (Venezuela, Aragua, Maracay) N: Nanjing University (People's Republic of China, Jiangsu, Nanjing) NBG: South African National Biodiversity Institute (South Africa,Western Cape Province, Cape Town) NCU: University of North Carolina at Chapel Hill (U.S.A., North Carolina, Chapel Hill) NCY: Conservatoire et Jardins Botaniques de Nancy, Universite de Nancy I (France, Nancy) NE: University of New England (Australia, New South Wales, Armidale) NH: South African National Biodiversity Institute (South Africa, KwaZulu-Natal Province, Durban) NHM: University of Nottingham (U.K., England, Nottingham) NHMR: Natural History Museum Rijeka (Croatia, Rijeka) NMNL: Natuurmuseum Nijmegen e.o. (Netherlands, Nijmegen) NOU: Institut de Recherche pour le Development (IRD) (New Caledonia, Noumea) NSW: Royal Botanic Gardens & Domain Trust (Australia, New South Wales, Sydney) NT: Department of Environment, Parks and Water Security (Australia, Northern Territory, Alice Springs) NY: The New York Botanical Garden (U.S.A., New York, Bronx) ORT: Instituto Canario de Investigaciones Agrarias (ICIA) (Spain, Canary Islands, Puerto de la Cruz) OS: Ohio State University (U.S.A., Ohio, Columbus) P: Museum National d'Histoire Naturelle (France, Paris) PERTH: Western Australian Herbarium (Australia, Western Australia, Perth) PMA: Universidad de Panamá (Panama, Panamá, Panamá) PG: Plant Gateway (U.K., Surrey, Kingston-upon-Thames) PH: Academy of Natural Sciences (U.S.A., Pennsylvania, Philadelphia) PRE: South African National Biodiversity Institute (South Africa, Gauteng Province, Pretoria) PTBG: National Tropical Botanical Garden (U.S.A., Hawaii, Kalaheo) QCA: Pontificia Universidad Catolica del Ecuador (Ecuador, Quito) QRS: CSIRO (Australia, Queensland, Atherton) RB: Jardim Botanico do Rio de Janeiro (Brazil, Rio de Janeiro, Rio de Janeiro) REU: Universite de la Reunion (Reunion. Sainte-Clotilde) SALA: Universidad de Salamanca (Spain, Salamanca) SAR: Department of Forestry (Malaysia, Sarawak, Kuching) SGO: Museo Nacional de Historia Natural (Chile, Santiago) SI: Instituto de Botanica Darwinion (Argentina, Buenos Aires, San Isidro) SING: Singapore Botanic Gardens (Singapore, Singapore, Singapore) SP: Instituto de Botânica (Brazil, São Paulo, São Paulo) SPF: Universidade de Sao Paulo (Brazil, São Paulo, São Paulo) SPFR: Universidade de Sao Paulo (Brazil, São Paulo, Ribeirao Preto) SUVA: University of the South Pacific (Fiji, Suva)TEX: University of Texas at Austin (U.S.A., Texas, Austin) TAN: Parc Botanique et Zoologique de Tsimbazaza (PBZT) (Madagascar, Antananarivo) TCD: Trinity College (Ireland, Dublin) TEX: University of Texas at Austin (U.S.A., Texas, Austin) TNS: National Museum of Nature and Science (Japan, Tsukuba) TUM: Technische Universität München (Germany, Freising) U: Naturalis (Netherlands, Leiden) UAPC: University of Alberta (Canada, Alberta, Edmonton) UB: Universidade de Brasília (Brazil, Distrito Federal, Brasília) UEC: Universidade Estadual de Campinas (Brazil, Campinas) UIS: Universidad Industrial de Santander (Colombia, Santander, Bucaramanga) UPCB: Universidade Federal do Paraná (Brazil, Paraná, Curitiba) UPR: Botanical Garden of the University of Puerto Rico (Puerto Rico, Puerto Rico, Río Piedras) UPS: Museum of Evolution (Sweden. Uppsala) UPTC: Universidad Pedogógica y Tecnológica de Colombia (Colombia, Boyacá, Tunja) US: Smithsonian Institution (U.S.A., District of Columbia, Washington) USM: Universidad Nacional Mayor de San Marcos (Peru, Lima) WTU: University of Washington (U.S.A., Washington, Seattle) YA: National Herbarium of Cameroon (Cameroon, Yaounde) ZSS: Sukkulenten-Sammlung Zürich (Switzerland, Zürich) Where no information is available, a column contains the text '-'. Gene manifest ------------- This (gene_manifest.txt) is a tab-delineated file, with columns as follows: 1. Gene ID 2. Exemplar gene name 3. Species from which the exemplar gene name has been taken 4. Database name (of the database from which the exemplar gene name was obtained) 5. Record ID (of the database record from which the exemplar gene name was obtained) 6. URL (to the online database record from which the exemplar gene name was obtained) 7. In tree? (values 'Y' or 'N') - indicates whether this gene was used to build the species tree or not. The databases from which exemplar Gene names are taken are currently: UniProtKB: The UniProt Knowledgebase (http://www.uniprot.org) If no suitably named exemplar gene has been found, columns 2 – 6 contain the text ‘-’.