MetaCarvel Pipeline¶
MetaCarvel is a scaffolding program designed specifically to address the issues concerning the metagenomic data.
The scaffolding algorithm consists of several steps. MetaCarvel requires contig assembly fasta file, mapping of reads to contigs in BAM file and output directory as an input. It first analyzes the read to contig mapping and estimates the library insert size. Using these mappings, it generates a scaffold graph where nodes are contigs and edges are mate pairs mapped to two different contigs. This is done in libcorrect
program. Multiple links between two contigs are collapsed into a single link using bundler
program. After this, repeats in the graph are removed. High betweenness centrality nodes are identified using centrality
program and removed from the graph using repeat_filter.py
program. Contigs remaining after filtering repetitive contigs are assigned a unique orientation using orientcontigs
program. The bubbles caused due to polymorphic regions in the metagenome are identified using spqr
program. Once these bubbles are identified, they are collapsed and contig layout is performed in layout.py
program.
Installing the software¶
MetaCarvel is open source and available on github. You need to run following steps to get it installed
git@github.com:marbl/MetaCarvel.git
cd MetaCarvel
make
This would generate the executables required to run MetaCarvel.
Preparing the data¶
In order to run MetaCarvel, you will need two files. First one is the contig assembly of the raw reads using your favorite metagenome assembly program. The second one is the alignment of raw paired end reads to assembled contigs using your favorite alignment program. We strongly recommend to map paired end reads as single end reads since aligner enforces the reads to map in the distances given by the library size. If you are using Bowtie2 for read mapping, we recommend aligning reads this way:
bowtie2 -x idx -U forward_reads.fq | samtools view -bS - > alignment_1.bam
bowtie2 -x idx -U reverse_reads.fq | samtools view -bS - > alignment_2.bam
samtools alignment_total.bam alignment_1.bam alignment_2.bam
samtools sort -n alignment_total.bam -o alignment_total_sorted.bam
We need this file sorted by the read name and not the coordinates. You will also need to provide the output directory where you want your scaffolds to be written.
Running MetaCarvel¶
Before you run MetaCarvel, please make sure you have following dependencies: samtools, bedtools and NetworkX. After you have these dependencies, running MetaCarvel is pretty straightforward. You would need to execute run_new.py
file.
python run_new.py -h
usage: run_new.py [-h] -a ASSEMBLY -m MAPPING -d DIR [-f FORCE] [-r REPEATS]
[-k KEEP] [-l LENGTH] [-b BSIZE]
MetaCarvel: A scaffolding tool for metagenomic assemblies
optional arguments:
-h, --help show this help message and exit
-a ASSEMBLY, --assembly ASSEMBLY
assembled contigs
-m MAPPING, --mapping MAPPING
mapping of read to contigs in bam format
-d DIR, --dir DIR output directory for results
-f FORCE, --force FORCE
force re-run of pipeline, will remove any existing
output
-r REPEATS, --repeats REPEATS
To turn repeat detection on
-k KEEP, --keep KEEP Set this to kepp temporary files in output directory
-l LENGTH, --length LENGTH
Minimum length of contigs to consider for scaffolding
-b BSIZE, --bsize BSIZE
Minium mate pair support between contigs to consider
for scaffolding
The mandatory inputs for this are the FASTA file for contig assembly, the alignment file and output directory. There are other options as well.
-r
option lets you toggle the repeat detection process in the scaffolding pipeline. By default it is turned off, but you can provide -r true
parameter to include repeat detection. :code: -l option lets you set the minimum length of the contigs to consider for scaffolding. By default this is set to :code: 500. -b
option lets you set the minimum mate pair support between two contigs to consider for scaffolding. By default this is set to 3
. Each step in the pipeline generates temporary files as intermediate outputs. If you want to retain these files, you need to set -k
option to true
. By default, these files will be removed.
Interpreting the output¶
Since MetaCarvel consists of multiple steps, it generates some intermediate files. These files can be useful if you want to dig more into the analysis.
The libcorrect
generates two output files: contig_links
, having a raw links between the contigs based on mate pairs and contig_coverage
, having a coverage information for each contig. The bundler
program uses contig_links
file, bundles the links and outputs them in the bundled_links
file in a tsv format and in the bundled_graph.gml
in the gml format. If repeat detection is chosen then that would generate a file with name bundled_links_filtered
which will have the bundled links corresponding to all the non-repetitive contigs. The orientcontigs
program takes the bundled links as an input and outputs a graph with only one orientation corresponding to each contig. This graph is written in two formats: oriented_links
in tsv format and oriented.gml
in a gml format. The spqr
program takes orientated graph as an input and produces a potential bubbles as an output in a file called seppairs
. The layout.py
produces three main files. One is scaffolds.fa
, represnting the scaffold sequences, scaffolds.agp
, representing scaffolds in an AGP format and scaffolds.gfa
representing scaffold graph in GFA format. If you want to visualize the graphs used to generate scaffolds, you can either use either oriented.gml or scaffold_graph.gfa file and load it into any standard graph visualization software. We recommend using MetagenomeScope as you can visualize variants and scaffold paths simply by loading the agp output into the viewer.