preprocessing¶
abutils provides several functions for preprocessing sequence data, including merging paired-end
FASTQ files. The preprocessing module can handle different file naming schemas (Illumina and Element)
and supports various merging algorithms.
The primary function is abutils.tl.merge_fastqs(), which handles the process of organizing,
grouping, and merging paired-end FASTQ files.
preprocessing method |
function |
|---|---|
Merge paired-end reads |
|
Merge with fastp |
|
Merge with vsearch |
examples¶
merge paired-end FASTQ files using fastp
By default, merge_fastqs() uses fastp to merge paired-end reads, which also performs
quality filtering and adapter trimming.
import abutils
# merge paired-end FASTQ files from a directory
merged_files = abutils.tl.merge_fastqs(
files='path/to/fastq/directory',
output_directory='path/to/output',
schema='illumina', # file naming schema: 'illumina' or 'element'
compress_output=True, # output gzipped FASTQ files
verbose=True
)
customize quality trimming and adapter removal
You can customize the quality trimming and adapter removal parameters.
import abutils
# merge with customized quality trimming and adapter removal
merged_files = abutils.tl.merge_fastqs(
files='path/to/fastq/directory',
output_directory='path/to/output',
trim_adapters=True,
adapter_file='path/to/adapters.fasta', # custom adapter sequences
quality_trim=True,
window_size=5, # sliding window size
quality_cutoff=15, # quality threshold
minimum_overlap=20, # minimum overlap between reads
log_directory='path/to/logs' # save fastp reports
)
direct low-level merging with specific algorithms
For more control, you can directly use the algorithm-specific merge functions. This requires specifying each of the input files and output file paths, rather than simply input and output directories.
import abutils
# merge a specific pair of files with fastp
abutils.tl.merge_fastqs_fastp(
forward='path/to/sample_R1.fastq.gz',
reverse='path/to/sample_R2.fastq.gz',
merged='path/to/output/sample.fastq.gz',
minimum_overlap=25,
allowed_mismatches=3,
trim_adapters=True,
quality_trim=True
)
# or with vsearch
abutils.tl.merge_fastqs_vsearch(
forward='path/to/sample_R1.fastq.gz',
reverse='path/to/sample_R2.fastq.gz',
merged_file='path/to/output/sample.fastq',
output_format='fastq',
minimum_overlap=25,
allowed_mismatches=3
)