This document describes the VM release of the PhyloPythiaS+ (PPS+) version 1.4 beta Contact: PhyloPythiaS@uni-duesseldorf.de This software is distributed as it is without any warranty. Please tell us about any errors you encounter, problems with the installation, or suggestions how to make the release better. The PhyloPythiaS+ release contains several third-party components that may cause problems that we cannot fix ourselves promptly. METHOD OVERVIEW PhyloPythiaS+ (PPS+) is a software for the automated and accurate taxonomic binning of a metagenome sequence sample to low ranking taxonomic bins. The software employs Bayesian marker gene classification combined with taxonomic assignment of metagenome shotgun samples using a composition based taxonomic binning method named PhyloPythiaS. The software works in two steps. In the first step, the input FASTA file of DNA sequences is processed employing the marker gene analysis. The marker gene analysis returns a list of taxa that were identified to be present in the sample and respective sample-derived sequences assigned to these taxa. In the second step, the composition based taxonomic binning method PhyloPythiaS is used to build models based on the list of taxa, the sample specific data and reference data from the public databases in the training phase. These models are consequently used to assign taxonomic identifiers to the sample sequences. VM SETUP - how to set up the virtual machine on your computer. The VM runs on any 64bit operating system on which 64bit Oracle VirtualBox can be installed. It requires at least 4GB RAM (2GB for the VM, but 8GB recommended) and 50-80GB disk space (80-110GB recommended, depending on the reference and sample used). We also strongly recommend to use SSD storage. The VM was tested on a laptop computer, Windows 7 64bit and Ubuntu 12.04 64bit, 4GB RAM, Intel i5 M520 2,4GHz, standard laptop HDD 7200 rpm; and under OS X, version 10.9, Intel i5 2557M 1.7GHz, 4GB RAM, SSD storage. To set up the VM, follow these steps: 1) Install the 64bit VirtualBox, choose the appropriate installation file on this web site: http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html 2) Download the image of the virtual machine from: http://algbio.cs.uni-duesseldorf.de/software/ppsp/1_4/ppsp_1_4_vm_64bit.ova (You can verify that the downloaded file is not corrupted by computing the md5 checksum of the file that must be: "d52377a9b8f4ef4291aaa9718d614936") 3) Import the VM image into the Oracle VirtualBox (usually, double click the "ppsp_1_4_vm_64bit.ova" file to import the VM). After the import finishes, the "Oracle VM VirtualBox Manager" displays the imported VM with name "hhu_vm_ppsp_64bit" in the list of all virtual machines. Click on "hhu_vm_ppsp_64bit", the settings of the virtual machine will be displayed on the right-hand side. Here, you can modify the properties of the VM. Under "System", you can increase the "Base Memory", which is the amount of the main memory given to the VM, e.g. you can give 50% of the physical main memory to the VM. Under "Shared folders", edit the shared folder, s.t. the "Folder Path" is the path to the directory that will later contain all "reference" data, set the "Folder Name" to "host_shared". Note that the shared folder may not support all Ubuntu file system operations, e.g. creating of symbolic links on Windows host file systems. The disk that contains this shared folder should have at least 25-55GB free space (depending on the reference used). 4) Start the VM (click on the hhu_vm_ppsp_64bit and then on Start) 5) Open the Terminal window and press enter (Note that the password for "user1" is an empty string.) 6) Install the latest version of the system tools, type and press enter: update -s 7) Install the latest version of the PPS+ package, type and press enter: update -t 8) Install the reference data, depending on which version of the reference you want to use, type one of the commands or both) and press enter: update -r NCBI20121122 update -r NCBI20140513 (The update command takes a name of reference data and installs it. After the script finishes, the reference data can be used. Note that this step is time consuming and can take up to several hours depending on your HW and internet connection, it downloads and decompresses a file of several GB) 9) Optional: To verify that the VM is set up correctly, you can run the following test, that runs the whole PPS+ pipeline with a small dataset, according to the reference data installed, type one of the commands and press enter: time ppsp -c /apps/pps/tools/config_ppsp_vm_refNCBI20121122_example.cfg -n -g -o s16 mg -t -p c -r -s or time ppsp -c /apps/pps/tools/config_ppsp_vm_refNCBI20140513_example.cfg -n -g -o s16 mg -t -p c -r -s (the results will be in /apps/pps/tests/test01/output or /apps/pps/tests/test02/output, respectively; this test will take approx. 2hours, depending on your HW) 10) The tutorial and LICENCE for PPS+ can be found in: /apps/pps/tools/LICENSE.txt /apps/pps/tools/ppsp_vm_tutorial.pdf VM DIRECTORY STRUCTURE /apps/pps/tools - contains all the tools and scripts required by PPS+ /apps/pps/samples - usually contains all the samples (FASTA files) to be analysed /apps/pps/samples/test_3strains - contains a sample FASTA file (*.fna) containing simulated contigs and corresponding labels (*.tax) /apps/pps/tests - store all the analysis results here /apps/pps/tools/config_ppsp_vm_refNCBI20121122_example.cfg - sample configuration file when using reference NCBI20121122 /apps/pps/tools/config_ppsp_vm_refNCBI20140513_example.cfg - sample configuration file when using reference NCBI20140513 /mnt/host_shared - is the mounted shared folder that contains the reference data, you can exchange files with the host operating system using this folder SAMPLE ANALYSIS CONFIGURATION - how to prepare/configure one sample analysis using PPS+ A) Create an empty directory in the /apps/pps/tests folder that will contain all the analysis results and temporary files (e.g. test01 and test02 are used for the example dataset) B) Create a configuration file, e.g. copy and modify file /apps/pps/tools/config_ppsp_vm_refNCBI20121122_example.cfg or file /apps/pps/tools/config_ppsp_vm_refNCBI20140513_example.cfg depending on the reference you want to use. Note that all the paths in the configuration file must be absolute paths. First, set the "pipelineDir" to the directory created in step A. The section "INPUT FILES" contains all the input files, where only the entry "inputFastaFile" is mandatory, it contains the sequences you want to classify. However, it's recommended to specify all the input files if available. The section "REFERENCE" contains paths to the reference data, you can use this settings for all the samples you classify. The reference data is stored in the shared folder. The section "TOOLS" contains paths to the tools and scripts required by PPS+, you don't have to change these paths. In the "BASIC SETTINGS" section, you can set the following: down to which taxonomic rank you want to assign the sequences and the maximum number of clades that you want to model. Note that it's recommended to consider only sequences of at least 1000bp length; and it's not recommended to change anything in the "ADVANCED SETTINGS". RUN PPS+ SIMPLE - a list of basic commands that can be run from a command line # to get all the options of the master script, type: ppsp -h # run the whole pipeline employing the marker gene analysis (if scaffolds are available, instead of "-p c" type "-p c s v") ppsp -c CONFIGURATION_FILE -n -g -o s16 mg -t -p c -r -s # get predictions only based on the marker gene analysis (it doesn't run PhyloPythiaS), check the configuration parameter "minSeqLen"! ppsp -c CONFIGURATION_FILE -n -g -o s16 mg -s # run the whole pipeline using the general model for the most "maxLeafClades" abundant clades, at a given taxonomic rank, in the reference (if scaffolds are available, instead of "-p c" type "-p c s v") ppsp -c CONFIGURATION_FILE -o general -t -p c -r -s RUN PPS+ HARDER - modify the list of clades or use expert sample specific data to train PhyloPythiaS (note that this option hasn't been tested properly) # run the marker gene analysis and prepare the output for PhyloPythiaS ppsp -c CONFIGURATION_FILE -n -g -o s16 mg # Now, file pipelineDir/working/ncbids.txt (where "pipelineDir" is the directory set in the configuration file) contains all the leaf level clades to be modelled by PhyloPythiaS. You can delete some of the clades or add new ones, but please make sure that the file contains only leaf level clades. # You can also change which sample-derived data will be used to build the models. The directory pipelineDir/working/sampleSpecificDir contains all the sample-derived data found via the marker gene analysis. Each FASTA file contains sample specific data for ncbi taxon id denoted by the prefix of the file name (e.g. file named "126.1.fna" contains sample-derived data for ncbi taxon id 126). If you want to add more (expert) sample-derived data, create a FASTA file with an appropriate name in that directory. It is possible to have more than one FASTA file containing reference sequences for one ncbi taxon id, you can name such files as: 126.1.fna, 126.2.fna, 126.3.fna, etc.) # Note that the list of clades (ncbids.txt) and the content of the sample-derived directory (sampleSpecificDir) must be consistent. (Note that, usually, it's enough to have at least 100kb of sample-derived data to model a clade.) # Run the rest of the pipeline with the modified list of clades or sample-derived data. (if scaffolds are available, instead of "-p c" type "-p c s v") ppsp -c CONFIGURATION_FILE -t -p c -r -s OUTPUT FILES pipelineDir/output - contains all important output files pipelineDir/output/inputFastaFile.fna.pOUT - tab separated file containing predicted sequences, first column .. sequence name, second column .. ncbi taxon id pipelineDir/output/inputFastaFile.fna.PP.pOUT - prediction file in the PhyloPythia format pipelineDir/output/summary_train.txt - a list of clades generated by the marker gene analysis that contains all clades that will be modelled, 1st column .. sample specific data in terms of bp, 2nd column .. sample specific data in terms of the number of sequences, 3rd column .. scientific name (ncbi taxon id) of a clade at rank superkingdom, 4th column ~ corresponding phylum rank, etc. (Note that this file is a result of the marker gene analysis and doesn't reflect any change that you did manually to the list of clades or the sample-derived data) pipelineDir/output/summary_all.txt - a list of all clades for which some sample specific data was found (not all clades from this list are modelled) pipelineDir/output/inputFastaFile.fna.cons - the scaffold contig consistency file (the consistency is computed based on the scaffoldsToContigsMapFile from the configuration file) pipelineDir/output/precision_recall.csv - precision and recall computed at different taxonomic ranks considering referencePlacementFileOut set in the configuration file as the true assignments of the sequences (see also entries recallMinFracClade and precisionMinFracPred from the configuration file) pipelineDir/output/precision_recall_no_ssd.csv - precision and recall computed only based on the input sequences different from the sample specific data pipelineDir/output/no_ssd.fas - contains sequences from the inputFastaFile without the sample specific data pipelineDir/output/cmp_ref - contains comparison tables for different taxonomic ranks, where rows correspond to the true assignments (referencePlacementFileOut from the configuration file) and columns correspond to the predicted assignments (ideally, all the data lie on the diagonal) pipelineDir/output/train_accuracy - contains precision and recall (as well as comparison tables) for different training data types where the training data are considered as true assignments (this data is available after you run the pipeline after training using option "-a") pipelineDir/output/log - contains log files for different subroutines (e.g. PPS train, PPS predict) pipelineDir/output/contigs_vs_scaff - comparison of the contig and scaffold predictions (if run with "-p c s v" and "inputFastaScaffoldsFile" and "scaffoldsToContigsMapFile" were set in the configuration file; predictions of the scaffolds correspond to the rows and predictions of the contigs correspond to the columns. (If the models are good, then the data should lie on the diagonal) WORKING FILES pipelineDir/working - contains working/temporary files pipelineDir/working/projectDir - is the PhyloPythiaS (PPS) working directory pipelineDir/working/projectDir/models - after PPS training phase finishes, the models are stored in this directory and can be reused pipelineDir/working/projectDir/sampled_fasta - fasta files used to train PPS (this directory can be removed after PPS training phase finishes or after training accuracy is computed) pipelineDir/working/projectDir/train_data - feature vectors generated from the "sampled_fasta" files used to train PPS (this directory can be removed after PPS training phase finishes) pipelineDir/working/ncbids.txt - list of clades that is used to train PPS (build models) pipelineDir/working/sampleSpecificDir - contains sample specific data for PPS pipelineDir/working/PPS_config_generated.txt - generated PPS configuration file pipelineDir/working/train_accuracy - contains temporary files used to compute the training accuracy (this directory can be removed after the training accuracy is computed) pipelineDir/working/crossVal - contains temporary files used to compare predictions of scaffolds vs predictions of contigs *.sl - large files containing temporary data (feature vectors) that can be removed after the whole pipeline finishes pipelineDir/working/*.ids - input fasta files where the sequence names were given working (temporary) sequence ids, where sequence id pattern "[0-9]+_[0-9]+" correspond to "scaffoldID_contigID" where the corresponding mapping (contigName -> contigID) is stored in file "*.cToIds" and scaffold mapping (scaffoldName -> scaffoldID) is stored in file "*.sToIds". The mapping (scaffoldId -> scaffoldId_contigId) is stored in file "*.mapSCIds" pipelineDir/working/*.ids.out - tab separated prediction file generated by PPS, first column contains sequence ids (the sequence ids correspond to the ids in file "*.ids"), last column contains ncbi taxon ids pipelineDir/working/*.ids.PP.out - prediction file in the PhyloPythia format generated by PPS (the sequence ids correspond to the ids in file "*.ids") pipelineDir/working/*.ids.ssd_cross - shows how the sample specific data (found by the marker gene analysis) were predicted for different clades (all computations are in terms of the number of sequences, not bp) FREQUENTLY ASKED QUESTIONS Q: Command "update" returns an error message. A: It's very likely that you are having network problems. You should try to run the command again after you resolve the network issue. Alternatively, you can download the sources manually from. For the reference data: Download: "http://algbio.cs.uni-duesseldorf.de/software/ppsp/1_4/reference_NCBI20121122.tar.xz" or "http://algbio.cs.uni-duesseldorf.de/software/ppsp/1_4/reference_NCBI20140513.tar.xz" Copy it to the shared directory "host_shared", and decompress it there (e.g. under linux systems using command: tar -xJf REFERENCE.tar.xz), s.t. there will be the following paths to the reference resources afterwards: host_shared/REFERENCE/mg3 host_shared/REFERENCE//sequences host_shared/REFERENCE/silva111 host_shared/REFERENCE/taxonomy For the tools: Download: "http://algbio.cs.uni-duesseldorf.de/software/ppsp/1_4/tools.tar.xz" Copy it to folder: /apps/pps Decompress it using command: tar -xJf tools.tar.xz Q: How can I compute the md5 checksum of a file? A: Linux: md5sum fileName OS X: md5 fileName Windows: E.g. use this program: www.winmd5.com Q: When importing the VM, I am getting: "VirtualBox - Error, VT-x disabled in the BIOS". A: Enable the Intel(R) - Virtualization Technology in the BIOS. Q: When importing the VM, I am getting: "VirtualBox - Error, VT-x disabled in the BIOS". A: Enable the Intel(R) - Virtualization Technology in the BIOS. Q: The marker gene analysis didn't find all clades that were expected in a sample. A: You can try to lower the bootstrap cutoff of the Naive Bayes classifier, which is by default 80% (see configuration parameter: mothurClassifyParamOther; change "cutoff=80" e.g. to "cutoff=70" or "cutoff=60"). Note that by lowering the cutoff, the number of false assignments can increase. Q: What's the maximum length of a DNA sequence in an input file? A: The sequence should be shorter than 1 Mbp. CHANGELOG 2014/06/19 - version 1.4 released (In the future, check: https://github.com/algbioi/ppsp/wiki) 2014/07/21 - FAQ updated