Faster alternative to Prinseq PE trimming polyN
1
0
Entering edit mode
5.7 years ago
umn_bist ▴ 390

I am using BBDuk for quality/adapter trimming and filtering a min. length of 40 bp for my RNAseq PE tumor/normal samples.

I am also trimming polyN reads (that are at least 75% of the read) for fear of them uniquely mapping.

My issue with prinseq is that I have to sort my PE files which takes a few hours for a single file.

#paste - - - - < "${file3}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file5}"
#paste - - - - < "${file4}" | sort -k1,1 -t " " | tr "\t" "\n" > "${file6}"

perl ${PRINSEQ} -fastq "${file5}" -fastq2 "${file6}" -no_qual_header -trim_right 1 \ -custom_params "A 75%;T 75%;G 75%;C 75%" -min_len 40 -out_format 3 \ -out_good "${file1%_1.fastq}_tst" -out_bad null -log



Are there any (faster) alternatives for this purpose?

RNA-Seq prinseq BBDuk • 1.3k views
0
Entering edit mode

why do you sort the fastq file ?

0
Entering edit mode

From PRINSEQ FAQ

PRINSEQ requires sorted input files for paired-end or mate-pair data processing. If your two FASTQ files of a paired-end (or mate-pair) dataset need to be sorted by their sequence identifiers, you can use the following one-liner in Linux/Unix/OSX:

paste - - - - < file_1.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_1_sorted.fastq
paste - - - - < file_2.fastq | sort -k1,1 -t " " | tr "\t" "\n" > file_2_sorted.fastq


This will first join the 4 lines (paste - - - -) of a FASTQ entry into a single line (with each of the 4 original lines separated by tabs), then sort them by their sequence identifier (-k1,1 -t " " specifies everything before the first space for the sorting, which is our sequence identifier), and write each entry again in 4 lines by replacing the tabs with line breaks. The sorted entries are then saved in a new file specified after the ">" sign.

The files I am working with are from TCGA.

1
Entering edit mode

I see. To go faster you could use:| LC_ALL=C sort -k1,1 -t ' ' |

2
Entering edit mode
5.7 years ago

Form you URL i don't think you need to sort your fastq files, if you know they are already paired.

0
Entering edit mode

This will be a huge, HUGE time saver. For future reference, if the files are already paired, can I assume that they are sorted?

1
Entering edit mode

if the files are already paired, can I assume that they are sorted

yes. Just check

paste <(paste - - - - < file_1.fastq | cut -f 1) <(paste - - - - < file_2.fastq  | cut -f 1 )


the two columns should have the same ID (modulo the /1 and /2 prefixes )

0
Entering edit mode

I accepted the answer. Once again, thank you, thank you for the help. The command will be very useful!