Long read assembly via CANU using HPC

Last updated on Jan 15, 2020 4 min read bioinformatics, Genomics, HOW2DO

What is all this about?

This bioinformatic notes include step by step work towards assembling long reads using canu. Notice that this how to do notes are based on Henry2 capabilities. Once your sequencing run is complete, the sequencing lab will most likely provide the following files:

m54163.scraps.bam.pbi
m54163.sts.xml
m54163.subreads.bam
m54163.subreads.bam.pbi
m54163.adapters.fasta
m54163.subreadset.xml
m54163.baz2bam_1.log
m54163.transferdone
m54163.scraps.bam

Bold files are some of the most important!

What are subreads?

Subreads represent the region of DNA that was sequenced, depending of how many times the polymerase read that region you may have several subreads for a single genomic region.

Prerequisites and installation in HPC

The following list includes all programs we need to install in the HPC:

bamtools
canu

If you need help installing spack. Check out this tutorial. Remmeber you need to have access to share and usrapps directories. From your HPC instance you can type:

source /usr/local/usrapps/PIunityID/spack/spack_use.csh #To activate spack

spack list bam #searches for bamtools

spack install bamtools #Installs bamtools

spack install canu #Installs canu

The pipeline using CANU

1) Untar sequencing data and convert `subreads.bam` to `subreads.fastq` using `bamtools`

Before unziping sequencing, we need to move the files to the HPC where we will run the pipeline. Use scp to move files into HPC. Prepare a job script to acomplish this step with the following task:

Load bamtools
Unzip files

Extract all reads from bam file.

#!/bin/csh

## Untar sequencing data and convert subreads.bam to subreads.fastq using bamtools

#BSUB -o out.%J
#BSUB -e err.%J 
#BSUB -W 72:00
#BSUB -n 16
#BSUB -R "rusage[mem=16000] span[ptile=8]"
#BSUB -x 
#BSUB -J prep_cfim_canu_assembly

source /usr/local/usrapps/PIunityID/spack/spack_use.csh #To activate spack

spack load bamtools

tar -xvf SEQUENCING_files.tar 
cd SEQUENCING_files

#Extract all reads
bamtools convert -format fastq -in your.subreads.bam -out all_your.subreads.fastq

2) Read correction using `-correct`

Canu’s read correction task will replace the original noisy read sequences with consensus sequences computed from overlapping reads. This task takes a while. It is best to run it by itself. Prepare the following job script:

#!/bin/csh

## Read correction using -correct 

#BSUB -o out.%J
#BSUB -e err.%J 
#BSUB -W 96:00
#BSUB -n 16
#BSUB -R "rusage[mem=16000] span[ptile=8]"
#BSUB -x 
#BSUB -J correction

source /usr/local/usrapps/PIunityID/spack/spack_use.csh #To activate spack

spack load canu #To load canu

canu -correct -p assembly_all -d assembly_all_canu_pacbio genomeSize=40m \
	-pacbio-raw /path/to/all.your.subreads.fastq \
	-maxMemory=16 \
	-maxThreads=8 \
	usegrid=0

3)Read triming using `-trim`

Canu’s read trimming task will use overlapping reads to decide what regions of each read are high-quality sequence, and what regions should be trimmed. After trimming, the single largest high-quality chunk of sequence is retained.

The following job is designed for trimming the all reads data set in the HPC:

#!/bin/csh

## Read triming using -trim

#BSUB -o out.%J
#BSUB -e err.%J 
#BSUB -W 96:00
#BSUB -n 32
#BSUB -R "rusage[mem=32000] span[ptile=16]"
#BSUB -x 
#BSUB -J trim_assembly_all

source /usr/local/usrapps/PIunityID/spack/spack_use.csh #To activate spack

spack load canu

#Read triming for all 
canu -trim -p assembly_all -d assembly_all_canu_pacbio genomeSize=40m \
	-pacbio-corrected /path/to/assembly_all.correctedReads.fasta.gz \
	-maxMemory=32 \
  -maxThreads=16 \
  usegrid=0

4) Assembly with 2.5% error rate

Finally assembly!

The HPC job looks like this:

#!/bin/csh

## Assembly with 2.5% error rate

#BSUB -o out.%J
#BSUB -e err.%J 
#BSUB -W 120:00
#BSUB -n 16
#BSUB -q shared_memory
#BSUB -R "rusage[mem=32000] span[hosts=1]"
#BSUB -x 
#BSUB -J assembly_all_0.025

source /usr/local/usrapps/PIunityID/spack/spack_use.csh #To activate spack

spack load canu

set numCores=`lscpu | grep "^CPU(s)" | awk '{print $2}’`
echo $numCores

#Assembly with 2.5% error rate 
#For all
canu -assemble -p assembly_all -d assembly_all_erate-0.025 \
        genomeSize=40m \
        correctedErrorRate=0.025 \
        -pacbio-corrected /path/to/assembly_all.trimmedReads.fasta.gz \
        -maxMemory=32 \
        -maxThreads=$numCores \
        usegrid=0

Some useful monitoring commands:

#Checking if it run:

bjobs -l
tail -f
bhist
 
#To modify time, ##### = jobnumber
bjobs -l #####
bmod -W 96:00 #####

This first assembly process produced the following results:

All_0.025

Total units: 23
Reference: 40000000
BasesInFasta: 31899376
Min: 2043
Max: 5440577
N10: 5440577 COUNT: 1
N25: 4585891 COUNT: 2
N50: 3215283 COUNT: 5
N75: 1194864 COUNT: 10
E-size:2750431.27

how2do genomics

Camilo H. Parada-Rojas

Postdoctoral Scholar

Create by Camilo H. Parada-Rojas