Question

Removing N's from NextSeq reads

0

Entering edit mode

6.2 years ago

exin ▴ 60

Hi,

I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp), I'm trimming off the end of R2 where Qscore is low, but I also suspect the low mapping-% could be contributed by all the N's in my sequences.
I'd like to get rid of sequences with excessive N's before mapping.

Sample fastq:

@NS500799:417:HHMWHBGX7:1:11101:10296:1148:UMI:GAGACT:
GCCCTGCCTTCAAATGAAANNAGTTCAGCATGCCC
+
AAAAAEEEEEE66<EE/66EE<EEA6/A//EAAEE
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/
@NS500799:417:HHMWHBGX7:1:11101:22941:1238:UMI:CCGTTT:
GCGCAGAGTTTAAACGCGAATNNGCTCAACTGGTT
+
AA/AA//EEEEEEEEEEEEEEEEEEEAE/EEEEEE

Desired output:

@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/

Q1: I can't get the right command to do this. Something like

grep -v "$(grep -A 2 -B 1 "NN" in.fastq)" in.fastq

sort describes what I want to do, but undesirably gets rid of the +'s for me. Any ideas?

Q2: As R1 & R2 are paired reads coming as separate fastq files, if I remove the sequences with N's in R2, how does this affect the demultiplexing and bowtie mapping step? Should I be removing corresponding sequences in R1 too? Unfortunately I don't understand the py script (CEL-Seq2 by Yanai Lab) well enough to determine this.

Thank you!!

RNA-Seq sequence cel-seq unix • 5.2k views

ADD COMMENT • link updated 5.0 years ago by Biostar 20 • written 6.2 years ago by exin ▴ 60

score 4 · Answer 1 · 2018-07-25

4

Entering edit mode

6.2 years ago

Tm ★ 1.1k

You can download prinseq and can use prinseq-lite.pl script in it to remove sequences containing desired number of "Ns". You can use parameter command having following paramters:

prinseq-lite.pl -fastq test_R1.fastq -fastq2 test_R2.fastq -ns_max_n (number of 'N')

I've received my sequences from Illumina NextSeq in fastq files (R1-15bp R2-77bp),

But just wondering, how come your R1 file is having only 15bp? You are working on which application?

ADD COMMENT • link 6.2 years ago by Tm ★ 1.1k

0

Entering edit mode

A shorter read 1 is normal in scRNAseq, since it's mostly UMI and cell barcodes followed by a stretch of AAAAAAAAAA...

ADD REPLY • link 6.2 years ago by Devon Ryan 104k

0

Entering edit mode

Ok, I have never worked on scRNAseq so didn't know this.

ADD REPLY • link 6.2 years ago by Tm ★ 1.1k

0

Entering edit mode

Thank you!! Will look into prinseq.

Also still wondering if it's possible to achieve this with just bash commands? It seems do-able I'm just not proficient at it.

Yes I did CEL-Seq. R1 is for UMIs and barcodes only.

ADD REPLY • link 6.2 years ago by exin ▴ 60

score 3 · Answer 2 · 2018-07-25

3

Entering edit mode

6.2 years ago

GenoMax 146k

This is not how you should removing the N's. By deleting cycles in the middle of the read you are going to screw up the reading frame. If you want to remove N's you should then trim the read all the way to the 3'-end.

If all of your reads have N's in that specific position then it is possible that the sequencer experienced a hardware/software issue during the run. Illumina generally replaces reagents for runs that fall in this category (as long as you have a maintenance contract) so your sequence provider should run this sample at no charge.

Edit: I see that this is a scRNA run. So I would be inclined to think that this run was likely overloaded and did not have enough phiX spike-in.

ADD COMMENT • link 6.2 years ago by GenoMax 146k

0

Entering edit mode

I'm not sure if I'm missing something, can you explain how what I wanted to do would screw up the reading frame? I'm not removing just the Ns, but discarding entire reads from the mapping step, as long as they contain consecutive N's. These N's are still there after trimming.

Thanks for pointing out the likely cause! Speaking to the sequencing facility to see if this was the issue...

ADD REPLY • link 6.2 years ago by exin ▴ 60

0

Entering edit mode

My apologies for not looking at your example carefully. I thought you were removing N's leaving the rest of the sequence intact since you posted a sequence titled desired output. I edited my comment above.

ADD REPLY • link 6.2 years ago by GenoMax 146k

score 2 · Answer 3 · 2018-07-25

2

Entering edit mode

6.2 years ago

Pierre Lindenbaum 164k

? Should I be removing corresponding sequences in R1 too?

yes

Any ideas?

linearize both fastq R1 and R2, filter with awk, convert back to interleaved fastq

paste <(gunzip -c S1.R1.fq.gz | paste - - - -)  <(gunzip -c S1.R2.fq.gz | paste - - - -) | awk '!($2 ~ /N/ || $6 ~ /N/ )'  | tr "\t" "\n"

and if your software cannot handle interleaved fastq, see deinterleave fastq file

ADD COMMENT • link 6.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks Pierre! I'm a bioinformatics newbie, could I trouble you to expand your answer a bit more? How do I linearise R1 and R2 with awk? And how does the cleaned-R2's come into these steps?

ADD REPLY • link 6.2 years ago by exin ▴ 60

1

Entering edit mode

you don't linearize with awk, you linearize the 4 lines of each fastq with paste.

$ seq 1 8
1
2
3
4
5
6
7
8

$ seq 1 8 | paste - - - -
1   2   3   4
5   6   7   8

you do this with both R1/R2.

awk is used to test if column 2 or column 6 contains any 'N'

ADD REPLY • link 6.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Great. Thanks for the elaboration!!

ADD REPLY • link 6.2 years ago by exin ▴ 60

score 2 · Answer 4 · 2018-07-25

fastp -i test.fq -n 1 -o test.n.fq 

$ cat test.n.fq 
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/

fastp is in brew, conda repos. n is number of minimum number of Ns tolerated. In this case 1. Any read with more than one N, will be removed.

Syntax for paired end (copy/pasted from fastp github) and try adapting above code for paired end:

fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz

score 1 · Answer 5 · 2018-07-25

1

Entering edit mode

6.2 years ago

lakhujanivijay 5.9k

Alternate solution using seqkit

seqkit grep -srv -p 'N' sample.fastq

ADD COMMENT • link 6.2 years ago by lakhujanivijay 5.9k

score 1 · Answer 6 · 2018-07-25

1

Entering edit mode

6.2 years ago

cpad0112 21k

Q1 using only grep:

$ grep --no-group-separator  -B 1 "NN" test.fq | grep '\@' | grep  -v -f - <(grep '\@' test.fq) | grep -A 3 -f - test.fq
@NS500799:417:HHMWHBGX7:1:11101:16095:1210:UMI:TGCACC:
CGCGCCGCAAGACTGGTACACAATGACTGAAATGA
+
AAAAAEEEEEEEEEEEEEA/EEEEEAEEEEEEEE/