Extracting reads with homopolymer errors
1
0
Entering edit mode
7.5 years ago

From a the mapped reads of a bam file, how can extract a subset of reads that contain k-mers (for e.g., reads that contain TTT, TTTTTT, TTTTTTTT, TTTTTTTTTTTTT) ?

homopolymer errors • 1.8k views
ADD COMMENT
0
Entering edit mode

Would make more sense to extract those from the fastq, no?

ADD REPLY
0
Entering edit mode
7.5 years ago
venu 7.1k

If you want only reads where a nt is repeated a particular number of times, you can do the following

samtools view file.bam | grep 'T\{4,\}' | sed 's/^/>/' | awk -F'\t' '{print $1"\n"$10}' > foo.fa

Here, nt T is consecutively found a minimum of 4 times. You can change the number as per your requirement.

As @genomax, suggested, updated to create a fasta file with reads. If it is mandatory to convert to fastq, I would suggest OP to do some work like finding this.

ADD COMMENT
0
Entering edit mode

You may want to extend this to generate reads in fastq format which is what OP likely wants.

ADD REPLY

Login before adding your answer.

Traffic: 2036 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6