Tutorial

Workflow

The NaPDoS bioinformatic pipeline is shown in the following diagram. The web interface to this pipeline is divided into five consecutive steps. Click on the link for each step to get detailed instructions.


Web Interface Steps

  1. Preliminary Candidate Screening

  2. HMM search (Genomic data only)

  3. BLAST search

  4. TREE construction

  5. Interpretation of Results

napdos_flowchart.png

Preliminary Candidate Screening

Basic Procedures

  1. Begin NaPDoS analysis by selecting a domain type
    (KS or C).

  2. Select a query type (protein, predicted coding sequence, PCR product, genomic, or metagenomic).

  3. Enter query sequence(s) in fasta format, either by pasting into the text box provided, or uploading a fasta format file.

    • Unless query type protein (amino acid) is specifically selected, sequences should be submitted in nucleic acid format.

    • Nucleic acid sequences are translated into predicted amino acids. Translations of genomic and metagenomic data include all 6 possible reading frames.

    • Sequences smaller than the minimum size required for reliable domain detection are discarded. (Minimum size for KS domains = 200 aa; C domains = 400 aa). These parameters can be adjusted using Advanced Settings.

    • Maximum file size limits are currently set at < 30 MB and < 50,000 individual sequences.

    • Please contact us if you would like to analyze larger data sets.

  4. Clicking on the "seek" button brings up a new window showing estimated processing time and search parameters to be used.

    • Analysis does not actually begin until the "submit job" button on this new window is clicked.
tutorial1.png
tutorial_submit_job.png

Advanced Settings

Default parameter settings are recommended for routine use. However, in some cases, users may wish to boost sensitivity by choosing less stringent HMM or BLAST criteria, or shorter minimum sequence lengths. Users should be aware that these adjustments may increase false positive predictions. Conversely, selectivity can be improved by using lower e-values and longer minimum match lengths, at the cost of decresed sensitivity.
  • HMM settings apply to genomic or metagenomic sequence sets only.

  • Default C domain HMM parameters are based on recommended trusted cutoff values for PFAM Condensation domain model PF00668.12 [1].

  • Default KS domain HMM parameters are based on the Ketosynthase domain model developed by Yadav et al [2].

  • Recommended values for minimum protein fragment length and BLASTP e-values were established empirically using manually curated reference database examples.

tutorial2.png

HMM search

For genomic sequences only, preliminary domain candidate information based on Hidden Markov Model (HMM) search is displayed on a separate page. Users may find this information helpful in estimating the total number and positions of PKS/NRPS operons present in a genomic or metagenomic sequence set. However, these intial results should be interpreted with some caution, for the following reasons:
  • For incomplete draft genomes or metagenomes, some candidates detected at this stage may represent partial or overlapping gene fragments or duplications.

  • Some candidates identified by HMM search only may encode protein fragments with PKS/NRPS-related functionality (e.g. fatty acid synthases), but do not actually produce compounds traditionally classified as natural products.

  • More stringent search methods are applied in later stages of the NaPDoS analysis pipeline to help resolve these issues.

tutorial_genome_search.png

BLAST search

A BLAST search is performed against curated reference database examples to identify matches to known PKS/NRPS pathways. Some suggested guidelines for interpreting blast scores are presented below. To proceed with further analysis, one or more candidate sequences must be selected using check boxes. Three different output options are available:
  • Output selected sequences provides trimmed candidate sequences in a fasta file format. These sequences can then be used to perform BLAST searches against the NCBI database (highly recommended), in case similar domains might not yet have been added to NaPDoS.

  • Output alignment will display MUSCLE [3] results for selected candidates and their blast matches in MSF format. The MSF file can be downloaded for additional offline analysis, for example to make manual adjustments to the alignment, or create custom trees using alternative programs.

  • Construct tree progresses to the next stage of the NaPDoS analysis pipeline, inserting candidate sequences plus their blast matches into a manually curated alignment of previously characterized database sequences.

tutorial_domain_search1.png

In some cases, the number of candidate matches on this page may be fewer than the number reported on the earlier genomic summary page, reflecting differences between HMM and BLAST stringencies used for the analysis.

Tree Construction

Selected candidate sequences plus their blast matches are trimmed and inserted into a manually curated reference alignment, keeping the original reference alignment intact. This alignment is used to build a tree, which is often more useful than blast results alone in predicting whether pathway products for candidate domains are likely to be similar or different from previously known examples [4].

Tree output options

  • Phylogenetic domain trees are built using FASTTREE to estimate maximum likelihood [5].

  • Tree building does not actually begin until the "submit job" button is clicked.

  • After the tree is built, users choose either Newick (plain text) format or an SVG graphics image as an output format.

  • User sequences are highlighted in the SVG graphics image format with red dots, as shown in the example below.

  • FastTree output does not include bootstrap values. However, the program does provide confidence values, which are included in the Newick format output. These values can be visualized by opening the Newick file with most stand-alone GUI interface tree viewing programs, for example the open source software FigTree.


tutorial_svg_tree.png
tutorial_tree_choice.png



Newick format output

(hctox1_C2_dual:1.21247,(hctox5_C3_dual:1.58329, (hctox1_C3_dual:0.94480,hctox4_C3_dual:1 .08446) 0.842:0.19209)0.855:0.17790, (cyclo1_C12_dual:1.37115,((NC_013790. 1_3_5_1279_1556:0.76822,surfa4_C3_LCL:0. 60447)1.000:1.11061, (syrin1_C2_dual:0.42670,(syrin1_C8_dual: 0.45307,(syrin1_C4_dual:0.04019, syrin1_C3_dual:0.04876) 1.000:0.37152)0.909:0.19372)0.995:0. 59611)0.884:0.23364)0.761:0.11321);
SVG format output

tutorial_svg_tree.png

Interpreting Results

BLAST hits for KS or C domains with more than 85%-90% identity at the amino acid level indicate that the query domains may be associated with the production of the same or a similar compound as those produced by the reference pathway. If you detect domains with less than 80% identity to any characterized domain in the NaPDoS database, a BLAST search against the NCBI nr database is recommended. Although we will update the database regularly, the NaPDoS database does not contain all characterized biosynthetic pathways. If this search does not find any known domain with more than 85% identity, the biosynthetic gene cluster has most likely not yet been characterized. In these cases it is possible that the encoded compound is new.

Constructing a phylogenetic tree can classify the domains, which may not necessarily be shown by the best BLAST hits. This classification can be informative in terms of predicting the type of compound produced. The domain classes have been defined based on the clades observed in the reference trees [6] .

References

  1. Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K et al: The Pfam protein families database. Nucleic Acids Res 2010, 38(Database issue):D211-222.

  2. Yadav G, Gokhale RS, Mohanty D: Towards prediction of metabolic products of polyketide synthases: an in silico analysis. PLoS Comput Biol 2009, 5(4):e1000351.

  3. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004, 32(5):1792-1797.

  4. Jenke-Kodama H, Sandmann A, Muller R, Dittmann E. Evolutionary implications of bacterial polyketide synthases. Mol Biol Evol. 2005 Oct;22(10):2027-39.

  5. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52(5):696-704.

  6. Ziemert N, Podell S, Penn K, Badger JH, Allen E, Jensen PR The Natural Product Domain Seeker NaPDoS: A Phylogeny Based Bioinformatic Tool to Classify Secondary Metabolite Gene Diversity. PLoS One. 2012;7(3):e34064 Epub 2012 Mar 29. PubMed PMID: 22479523