Incomplete filtration of reads by length in .fastq and in .bam files
0
0
Entering edit mode
4.5 years ago

Hi,

I want to filter my reads to have only 29-31 nt long ones. I tried these three methods:

## 1:
awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 29 && length(seq) <= 31) {print header, seq, qheader, qseq}}' < in.fq > out_filtered.fq

## 2:
reformat.sh in=in.sam out=out_filtered.sam minlength=29 maxlength=31

## 3:
seqkit seq in.fq -j 10 -m 29 -M 31 > out_filtered.fq

At the end, none of them had reads longer than 31 nt but all of them had some shorter than 29 nt. Around 5% of the filtered reads are 28 nt long and there are less than 1% of them with less than 28 nt length.

Why is this happening and how can I fix this?

Or is it possible that the scripts 'psite' and 'phase_by_size' from the plastid package is doing something wrong to give this result?

Here is the link for two pics of results of 'psite' and 'phase_by_size': https://photos.app.goo.gl/sGhT2dvoNsAQHF7d7

Thanks.

Farkas

RNA-Seq • 791 views
ADD COMMENT
2
Entering edit mode

To me, the awk seems to do the job (I didn't check the other methods).

For control, you could do {print length(seq), seq}.

And check manually the corner cases (e.g. len == 29).

Do you really convert the out_filetered.fq to *.bam and run phase_by_size on that?

ADD REPLY
1
Entering edit mode

Agreed, you should check manually if this tool is not reporting something odd.

I did:

$ cat test.fq
@longer
AGCTCCGATCGATCGATCGATCGAGTAGCTAGCTACGATCGATCGATCGCACTAGC
+
AGCTCCGATCGATCGATCGATCGAGTAGCTAGCTACGATCGATCGATCGCACTAGC
@shorter
AATTGC
+
AAAAAA
@30
AAAAACCCCCGGGGGTTTTTGGGGGAAAAA
+
AAAAACCCCCGGGGGTTTTTGGGGGAAAAA
@28
AGCTAGCTAGCTAGCTAGCTAGCTAGCT
+
AGCTAGCTAGCTAGCTAGCTAGCTAGCT

$awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 29 && length(seq) <= 31) {print header, seq, qheader, qseq}}' < test.fq 
@30
AAAAACCCCCGGGGGTTTTTGGGGGAAAAA
+
AAAAACCCCCGGGGGTTTTTGGGGGAAAAA
ADD REPLY
0
Entering edit mode

Yes, checking manually it gives only reads 29-31 nt.

out_filtered.fq files were mapped and the psite and phase_by_size were run on the generated .bam files.

So now the question is why these scripts show this weird results... Maybe I should just skip this plastid stuff and use something else instead.

Thanks!

ADD REPLY
1
Entering edit mode

Can you try following two step procedure with BBtools?

reformat.sh in=in.sam out=stdout.sam maxlength=31 | reformat.sh in=stdin.sam out=out_filtered.sam minlength=29
ADD REPLY
0
Entering edit mode

I tried. Gives the same results.

ADD REPLY
1
Entering edit mode

Following works for me using @ATPoint's test file.

$ reformat.sh in=../test.fq out=stdout.fq minlength=29 maxlength=31

Input is being processed as unpaired
@30
AAAAACCCCCGGGGGTTTTTGGGGGAAAAA
+
AAAAACCCCCGGGGGJJJJJGGGGGAAAAA
Input:                          4 reads                 120 bases
Short Read Discards:            3 reads (75.00%)        90 bases (75.00%)
Output:                         1 reads (25.00%)        30 bases (25.00%)

Time:                           0.072 seconds.
Reads Processed:           4    0.06k reads/sec
Bases Processed:         120    0.00m bases/sec

OR

$ reformat.sh in=../test.fq out=stdout.fq maxlength=31 | reformat.sh in=stdin.fq out=stdout.fq minlen=29 int=f

Set INTERLEAVED to false
Input is being processed as unpaired
Input is being processed as unpaired
Input:                          4 reads                 120 bases
Short Read Discards:            1 reads (25.00%)        56 bases (46.67%)
Output:                         3 reads (75.00%)        64 bases (53.33%)

Time:                           0.069 seconds.
Reads Processed:           4    0.06k reads/sec
Bases Processed:         120    0.00m bases/sec
@30
AAAAACCCCCGGGGGTTTTTGGGGGAAAAA
+
AAAAACCCCCGGGGGJJJJJGGGGGAAAAA
Input:                          3 reads                 64 bases
Short Read Discards:            2 reads (66.67%)        34 bases (53.13%)
Output:                         1 reads (33.33%)        30 bases (46.88%)

Time:                           0.084 seconds.
Reads Processed:           3    0.04k reads/sec
Bases Processed:          64    0.00m bases/sec
ADD REPLY

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6