File Size Management Tools

If your query files are too large for NaPDoS website submission, several options are available for pre-filtering and/or subdividing the data into smaller batches on your own computer.

  1. Assembly

    Short sequencing reads such as those obtained using Illumina technology (typically 150 nucleotides or shorter) cannot be used for NaPDoS classification unless they are first assembled into longer contigs, with a program such as Spades or IDBA-UD. This is because short query sequences do not provide enough information to determine whether or not they match PKS and NRPS reference domains, which are typically about 1250 nucleotides (425 amino acids) long. Query sequences should ideally cover at least half of that that length.

  2. Size filtering

    Assembled contigs often include sequences that are too short to allow detection of PKS and NRPS domains. The size_limit_seqs.pl perl script can be downloaded (by clicking the link) and used on the unix command line on your local computer to pre-filter a fasta file of contig sequences according to size, using the unix command line. Example commands:

        gunzip size_limit_seqs.pl.gz
        chmod 755 size_limit_seqs.pl
        ./size_limit_seqs.pl fasta_filename   num_seqs_per_subfile  minimum_seq_length > out_filename
    		
  3. File splitting

    Files can be split up into multiple parts by downloading the serialize_seqs.pl perl script (by clicking the link) and runnig it from the unix command line on your local computer. Example commands:

        gunzip serialize_seqs.pl.gz
        chmod 755 serialize_seqs.pl
        ./serialize_seqs.pl  fasta_filename  num_seqs_per_subfile
    		
  4. De-replication

    Identical or nearly-identical sequences can be consolidated using the CD-hit program. This approach can be particularly helpful for PCR or transcriptome data sets with high levels of sequence replication.

  5. Amino acid translation

    Nucleic acid sequence files can be pre-translated into amino acid sequence files, which will be often be smaller in total size, using programs such as Prokka.

  6. Preliminary BLASTX filtering

    Assembled contigs can be pre-selected based on a unix command-line Diamond BLASTX search against the downloadable domain reference sequences at the bottom of the NaPDoS Pathways page. Candidate sequences identified by the pre-filtering blast step can be selected on the unix command line with the perl script getseq_multiple.pl (download by clicking on link). Example commands:

       diamond makedb --in all_KS_191020_1877.faa -d all_KS_191020_1877.dmnd -t temp_directory 
       diamond blastx -d all_KS_191020_1877.dmnd -q orig_query_filename -e 1e-5 -p num_processors --max-target-seqs 1  -t temp_directory -o results_filename
       
       cut -f1 results_filename > candidate_query_ids
       gunzip getseq_multiple.pl.gz
       chmod 755 getseq_multiple.pl
       ./getseq_multiple.pl candidate_query_ids orig_query_filename > selected_candidate_sequences.fasta