I submitted a SeqMule job to SGE, it aborted without any error, what's wrong here?

One possible cause is that SeqMule exceeded the requested memory limit. How to troubleshoot? First, run qacct -j JOBID to identify job exit status. Make sure to replace JOBID with your actual job ID assigned by SGE. Then you will see something like this(assume accounting_summary to set to TRUE):

$ qacct -j 584552
==============================================================
qname        all.q
hostname     compute-0-3.local
group        wanglab
owner        user
project      NONE
department   defaultdepartment
jobname      STDIN
jobnumber    584552
taskid       undefined
account      sge
priority     0
qsub_time    Sat Jan 30 14:53:20 2016
start_time   Sat Jan 30 14:53:31 2016
end_time     Sat Jan 30 21:11:33 2016
granted_pe   smp
slots        12
failed       100 : assumedly after job
exit_status  137 #137-128=9, which is SIGKILL
ru_wallclock 22682
ru_utime     0.080
ru_stime     0.037
ru_maxrss    2760
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    16265
ru_majflt    0
ru_nswap     0
ru_inblock   16
ru_oublock   32
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     337
ru_nivcsw    62
cpu          66629.510
mem          338134.237
io           1547.102
iow          0.000
maxvmem      52.969G
arid         undefined

The line exit_status indicates that exit code is 137 - 128 = 9 which means SIGKILL. The line maxvmem tells us maximum memory used is 52.969G, if we only asked for 4G for each CPU, we were supposed to use only a total of 4G * 12 = 48G memory. This explains why SGE killed our job without letting SeqMule complain. Under such circumstances, we can request more memory by adjusting -l h_vmem option in qsub command, for example we can change -l h_vmem=4G to -l h_vmem=6G. Alternatively, we can decrease SeqMule memory usage by changing -jmem option in seqmule pipeline command, for example, we can change -jmem 1750m (the default) to -jmem 1024m. WARNING: changing -jmem only works if GATK is using too much memory and may have a side effect of insufficient memory for GATK.

Is there a log file showing the runtime error?

SeqMule only saves runtime parameters to *.log file. If you want to check the runtime output after running SeqMule in the background (or submitted to a cluster), please use nohup your_seqmule_command > output.txt &. nohup can run your command even after you log out. All messages that were printed on screen will be saved in output.txt file. Alternatively, you can append 2>stderr.txt to your SeqMule command. The STDERR message (all error messages) will be saved in output.txt and stderr.txt, respectively.

How to debug a runtime error?

When SeqMule aborts due to a runtime error, you will see a message like this:

----------ERROR----------
[ => SeqMule Execution Status: step 63 FAILED at Mon Apr  4 13:51:12 PDT 2016, Merge split VCF]
ERROR: command failed
/home/yunfeiguo/projects/seqmule_working/bin/secondary/../../bin/secondary/worker /home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs 63 "/home/yunfeiguo/projects/seqmule_working/bin/secondary/../../bin/seqmule stats -tmpdir /tmp -ref /home/yunfeiguo/projects/seqmule_working/database/human_g1k_v37.fasta -jmem 1750m -jexe java  -t 2 -u-vcf ..."
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
After fixing the problem, please execute 'cd /home/yunfeiguo/projects/variant_call/20160404' and 'seqmule run myAnalysisName.script' to resume analysis.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

This message does not tell you the specific reason of error. To find out the exact error, one has to refer to the error message of the failing program. In the above example, the failed program is step 63. All runtime information is stored in /home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs. Therefore its messages can be found in /home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs/63. For example:

$ ls /home/yunfeiguo/projects/variant_call/20160404/seqmule.04042016.5207.logs/63
JOBID.625242  MSG  PID.10739  STATUS.error  stderr  stdout

The JOBID.625242 shows the job ID assigned by SGE. If you are not running SeqMule via SGE, don't worry about it. If you are running SeqMule via SGE, you can run qacct -j 625242 (substitute 625242 by the real job ID) to check whether SGE has anything to say about this execution. Watch out for memory usage, time limit. If this program used more resources than requested, SGE will kill it silently. MSG is a human-readable message telling us what step 63 is. Here MSG says, 'Merge split VCF'. PID.10739 is the process ID of the program. STATUS.error tells SeqMule that this program exited with an error. stderr contains STDERR output from the program, and usually that can help you understand why the program failed. stdout contains STDOUT output. STDOUT can tell you some important things like the progress of execution, but most of time, it is less helpful for debugging.

Why did I get set: Variable name must begin with a letter. error when I tried to run the analysis script by qsub?

SeqMule added set -e at the begining of script to make it exit at first error. If your job scheduling system (e.g. SGE) tries to execute the script by csh or tcsh, it will return the above error. Please use bash at qsub instead. For example, under SGE, you can use:

echo 'your_seqmule_command' | qsub -V -cwd -S /bin/bash -N sample

How to solve GATK ERROR MESSAGE: Unable to parse header with error: /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub116224981.tmp (Too many open files), for input source: /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub116224981.tmp?

This happens because Linux limits the number of files opened. The easiest way is to use -gatknt option in seqmule to specify a lower number of CPUs (eg 1 or 2) used for GATK genotyping. Another option is to request your system administer to increase the limit.

Does SeqMule support merging of BAM files?

Yes. Run seqmule pipeline -m with other arguments. In this case, all BAM files will be merged and therefore only one sample prefix is needed.

What if GATK complains for insufficient memory?

Please use -jmem option to give GATK more memory. Note, however, you have to request more memory than specified by -jmem if you run your analysis on a computation cluster as there is memory overhead for java virtual machine.

How does the QUICK mode work?

Under QUICK mode, total region to be called will be split into N parts. N is the number of threads. Variant calling is then performed on each part simultaneously. Afterwards, the resulting VCF will be merged. For GATK, variant filtering is not done until after merging because VQSR filtering requires a lot of variants to calculate some statistics.

Why are the numbers on Venn diagram different from the statistics I got from the consensus VCF file?

Venn Diagram is plotted this way: ANNOVAR is used to convert VCF to AVINPUT; then for each variant, chromosome+start+end+reference allele+observed allele is used as an ID (alleles are not used for indels) to determine whether two variants overlap. Statistics for consensus result is calculated this way: GATK CombineVariants is used to extract consensus calls from multiple VCFs; if a VCF contains multiple samples, consensus extraction will be carried out on a per-sample basis; at last, VCFtools is used to calculate statistics for the resulting consensus VCF. Both the Venn Diagram and the statistics consist of two parts: SNV and indels. For SNV, the difference is very small (usually >0.5%). For indels, this is expected since there is no unique way to represent them in some cases. Besides, indels are intervalized (ignoring alleles) and extended by 10bp towards both ends for calculating statistics. Intervalization and extension are not done for obtaining consensus results as the alleles must be reported.

What if /tmp is too small or I got error message No space left on device?

You can specify a larger folder for storing temporary files. There are 2 ways to do this: first, set $TMPDIR in your environment variables; second, use -tmpdir YOUR_TMP_DIRECTORY option in seqmule pipeline or seqmule stats program.

After I finish seqmule analysis, I realized that I used a wrong capture file. How should I address this? Can I just manually change 'finished' to 'waiting' in that step (in the script file)?

Capture file is typically used in multiple stages of analysis (variant calling, stats calculation). Therefore it might not enough to change only one step of the analysis. One way you can try is that run seqmule pipeline -norun with the correct capture file, and then rerun analysis right after alignment seqmule run -n X (X stands for the step number after alignment). If you understand the script, you can modify the bed file name and rerun steps involving that file manually. Note however, right now it is not possible to rerun one particular step (unless that step is the last step) by seqmule run, SeqMule will run the script from a specified step (or resume where it stops) to the end.

How to solve the java.lang.NullPointerException problem?

Some users may see the following error message with GATK or GATKLite:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
    java.lang.ExceptionInInitializerError
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.<init>(GenomeAnalysisEngine.java:160)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.<init>(CommandLineExecutable.java:53)
    at org.broadinstitute.sting.gatk.CommandLineGATK.<init>(CommandLineGATK.java:54)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:90)
    Caused by: java.lang.NullPointerException
    at org.reflections.Reflections.scan(Reflections.java:220)
    at org.reflections.Reflections.scan(Reflections.java:166)
    at org.reflections.Reflections.<init>(Reflections.java:94)
at org.broadinstitute.sting.utils.classloader.PluginManager.<clinit>(PluginManager.java:77)
    ... 4 more

This error is likely to be caused by Java version incompatibility. Please make sure you are using Java 1.7 and put make it your default java program (put the folder containing java at the beginning of your PATH variable).

How are default exome regions defined? Where do they come from?

The default exome defintions (to use them, specify -capture default with seqmule pipeline command) came from UCSC genome browser's RefSeq Gene track (refGene table). Only exons were included plus 5bp at each end of each region. The regions were sorted and further processed to only retain chromosomes 1 to 22, X and Y for hg18 and hg19 genome builds. Overlapping regions were merged. Exonic variants and splicing variants can therefore be included in the results. However, variants in UTR and other regions may not show up. If you want to restrict your analysis for a particular capture kit, please refer to seqmule download -help to download common region definitions or use your own BED file.

How to solve ERROR MESSAGE: Bad input: Error during negative model training. Minimum number of variants to use in training is larger than the whole call set. One can attempt to lower the --minNumBadVariants arugment but this is unsafe. issue?

Here GATK is complaining about too few variants used in model training. This statistical model is used for VQSR (variant quality score recalibration). It is a better method for variant filtering. However, when your capture region is small (e.g. only a few genes), or your average depth is very low, the input might be not sufficient for model training. Hard filtering should be used instead. SeqMule has limited support for automatic hard filtering (when input BAM is smaller than 1GB for SNP, and 15GB for INDEL). The automatic detection is not guaranteed to work, so a safe option is to set forceSNPHardFilter or forceINDELHardFilter to 1 (in advanced_config) to enforce hard filtering. Note, there are a separate pair of such flags for each calling method of GATK, namely GATK UnifiedGenotype caller, GATK HaplotypeCaller and GATKLite UnifiedGenotype caller. They work independently.

How much space is needed for SeqMule databases?

hg19all and hg18all EACH needs approximately 60GB of space after decompressing, which includes the reference genome, index files for BWA, Bowtie, Bowtie2, SNAP and SOAP, databases for GATK. However, because compressed files and decompressed files must co-exist during decompression, one may need another 30 to 40GB of space during download. Note, the space requirement just mentioned is only for database, your input data typically has over 10GB, 20GB or 100GB of sizes depending on your situation, and two times or three times of additional space (relative to raw data) is needed to store intermediate analysis files. If space is limited, one option is to only download hg19 and its index for bwa (seqmule download -d hg19,hg19ibwa), and disable GATK VQSR filtering (set forceSNPHardFilter to 1 in advanced_config) such that auxillary variant databases are not needed. Using the above option, SeqMule only needs about 10GB of space for databases.

If I have to manually download all the databases due to unstable Internet connection, how should I place them in the database/ folder?

The database/ folder should have the following structure (for hg19) after unzipping:

database/
|-- 1000G_omni2.5.b37.vcf
|-- 1000G_omni2.5.b37.vcf.idx
|-- bowtie
|   |-- human_g1k_v37.1.ebwt
|   |-- human_g1k_v37.2.ebwt
|   |-- human_g1k_v37.3.ebwt
|   |-- human_g1k_v37.4.ebwt
|   |-- human_g1k_v37.rev.1.ebwt
|   `-- human_g1k_v37.rev.2.ebwt
|-- bowtie2
|   |-- human_g1k_v37.1.bt2
|   |-- human_g1k_v37.2.bt2
|   |-- human_g1k_v37.3.bt2
|   |-- human_g1k_v37.4.bt2
|   |-- human_g1k_v37.rev.1.bt2
|   `-- human_g1k_v37.rev.2.bt2
|-- bwa
|   |-- human_g1k_v37.fasta -> /absolute_path_to/database/human_g1k_v37.fasta
|   |-- human_g1k_v37.fasta.amb
|   |-- human_g1k_v37.fasta.ann
|   |-- human_g1k_v37.fasta.bwt
|   |-- human_g1k_v37.fasta.fai
|   |-- human_g1k_v37.fasta.pac
|   `-- human_g1k_v37.fasta.sa
|-- dbsnp_hg19_138.vcf
|-- dbsnp_hg19_138.vcf.idx
|-- hapmap_3.3.b37.vcf
|-- hapmap_3.3.b37.vcf.idx
|-- human_g1k_v37.dict
|-- human_g1k_v37.fasta
|-- human_g1k_v37.fasta.fai
|-- Mills_and_1000G_gold_standard.indels.b37.vcf
|-- Mills_and_1000G_gold_standard.indels.b37.vcf.idx
|-- snap
|   |-- human_g1k_v37.fasta
|   |   |-- Genome
|   |   |-- GenomeIndex
|   |   |-- GenomeIndexHash
|   |   `-- OverflowTable
`-- soap
    |-- human_g1k_v37.fasta.index.amb
    |-- human_g1k_v37.fasta.index.ann
    |-- human_g1k_v37.fasta.index.bwt
    |-- human_g1k_v37.fasta.index.fmv
    |-- human_g1k_v37.fasta.index.hot
    |-- human_g1k_v37.fasta.index.lkt
    |-- human_g1k_v37.fasta.index.pac
    |-- human_g1k_v37.fasta.index.rev.bwt
    |-- human_g1k_v37.fasta.index.rev.fmv
    |-- human_g1k_v37.fasta.index.rev.lkt
    |-- human_g1k_v37.fasta.index.rev.pac
    |-- human_g1k_v37.fasta.index.sa
    `-- human_g1k_v37.fasta.index.sai

Why do I see 'Could not find any mapped reads in target region ...'?

It is a warning message from FreeBayes. SeqMule will supply FreeBayes a BED file specifying which region to look at. If no reads are in some of regions, FreeBayes will complain with the above message.

I am confused about file naming, which one should I use for downstream analysis?

The file format for storing variants is VCF. The name of each VCF file implies how it is generated. For example, suppose we have the following files:

patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call.extract.vcf

patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call_union.vcf.idx

patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call_var_stat.txt

patientX.0_bwamem.sort.rmdup.readfiltered.0_varscan.somatic-call.vcf

patientX is your sample name. 0 is a number used internally. bwamem is the aligner. sort and rmdup mean the alignments have been sorted and duplicate reads have been removed after alignment. varscan is the variant caller. The somatic-call means the variants are generated under somatic variant calling mode. extract means the VCF file has been extracted based on the region definition by -capture option. *.idx file is VCF index, you can ignore it. *_var_stat.txt denotes variant statistics file. If you have multiple variant callers and enabled consensus variant calling, then consensus will appear in the file name. Usually, consensus or extract will be the file you want to use.

Which file is used as default argument for --advanced?

misc/advanced_config is the default advanced configuration file, it enables BWA-MEM, GATKLite, SAMtools, and FreeBayes. Please refer to the file for detailed parameters and steps.

Copyright 2014 USC Wang Lab