Question: Getting Unmapped Reads: Comparing Fastq To Bam
7
gravatar for User 9996
7.4 years ago by
User 9996800
User 9996800 wrote:

given a FASTQ file and a BAM file of aligned reads, is there an efficient way to get all FASTQ reads that are in the original FASTQ but not in the BAM? Perhaps using bedtools. i.e.:

unmapped_script original.fastq aligned.bam > unmapped.fastq

should create an unmapped.fastq file, which is a subset of original.fastq containing only those entries that do not appear in aligned.bam

thank you.

bedtools fastq bam samtools • 13k views
ADD COMMENTlink written 7.4 years ago by User 9996800
2

Every BAM file I've seen contains all reads, including unmapped ones. Are you sure that unmapped reads aren't present in your BAM file? Or am I misundersatnding

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k
1

I've mainly used BWA, so that's been my experience. I just checked bowtie, and if you run it as /path/to/bowtie --sam <genome> <input_file>, all reads are still reported, including unmapped reads (FLAG is 4, RNAME is *, POS and MAPQ are 0)

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

bwa definately included unmapped reads, I'm not sure bowtie does by default.

ADD REPLYlink written 7.4 years ago by Swbarnes21.4k

These BAMs are produced by Tophat. My understanding is that Tophat only reports aligned reads from FASTQ, isn't that right?

ADD REPLYlink written 7.4 years ago by User 9996800

I've never used TopHat, but a quick peek online makes it look like they only report mapped reads. First time I've seen that...

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

TopHat only puts in the mapped reads - it'll put them in multiple times if they map to many locations as well - so you may need to pipe column 1 into sort/uniq UNIX pipes to get unique mapped read names.

ADD REPLYlink written 7.4 years ago by Chris Penkett480
8
gravatar for brentp
7.4 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

If you sort the mapped BAM names and the read-names, you can use linux join to print out unpaired--e.g. unmapped reads.

FQ=$1
BAM=$2

# print out mapped headers
samtools view -F 4 $BAM | awk '{ print "@"$1"/1" }' | sort -u > mapped.txt
# grap only the fq headers.
awk '(NR % 4 == 1)' $FQ |  sort -u > reads.txt
# join and print only un-paired (-v)
join -v 1 reads.txt mapped.txt > unmapped.txt

call like::

bash unmapped.sh input.fq mapped.bam

and the unmapped reads will be in unmapped.txt, from there, you can use grep -f to get back a fastq file. Note, you may have to change the awk command (currently '{ print "@"$1"/1" }') to either not add the trailing "/1" or to add "/2" depending on the read end (or depending on the samtools flags.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by brentp23k

what is reads/WT... etc? in here? SHould that be changed to $FQ?

ADD REPLYlink written 7.4 years ago by User 9996800

yes. thanks....

ADD REPLYlink written 7.4 years ago by brentp23k

won't sort -u run into the same issues I point below? where sorting with unix is different from samtools sort -n? I can't get them to sort the same way it's very frustrating

ADD REPLYlink written 7.4 years ago by User 9996800

No, in this example brentp is using unix's sort for both files.

ADD REPLYlink written 7.0 years ago by Random160
6
gravatar for Mitch Bekritsky
7.4 years ago by
Mitch Bekritsky1.1k
London, England
Mitch Bekritsky1.1k wrote:

Get the query names from the BAM file:

samtools view <input file> | cut -f1 | sort > BAM_headers.txt

Get the query names from your FASTQ file (assuming your read length is 101bp):

awk '$0 ~ /^@/ && length($0) < 101' <fastq_file> | sed 's/^@//' | sort > FASTQ_headers.txt

Use diff to compare the two files.

diff BAM_headers.txt FASTQ_headers.txt

Since all reads in the FASTQ headers file will also be in the BAM headers file, diff should only show query names that are in the FASTQ file that are not in the BAM file. The output should show something like this:

7a8,9
> D74RYQN1_0189:1:2208:17093:200915#0
> D74RYQN1_0189:1:2208:17337:200823#0
8a11,12
> D74RYQN1_0189:1:2208:19283:200878#0
> D74RYQN1_0189:1:2208:19604:200869#0

That's your list of missing read names. You can parse the diff output to get the read names without leading '>' symbols, then use grep to get the actual reads from the FASTQ file if you'd like. If you want to use this approach and need help figuring out the last two parts, let me know.

ADD COMMENTlink modified 6.1 years ago by Istvan Albert ♦♦ 80k • written 7.4 years ago by Mitch Bekritsky1.1k
1

Do you think rlong's algorithm will work? It seems to me you have to have the FASTQ names sorted as well... I'd like to avoid awk and do it from within python if possible

ADD REPLYlink written 7.4 years ago by User 9996800

I do like your approach by the way, but it will have to take into account the trailing / or # signs that are used for paired end reads, since many alignment programs drop these. for example tophat will drop the /1 or /2 of the FASTQ read mate 1 and read mate 2 and only report the name that comes before the '/'. it's quite annoying

ADD REPLYlink written 7.4 years ago by User 9996800

For rlong's approach to work, you would need to sort the FASTQ headers as well. Otherwise, you'd have to store the BAM headers as a dictionary or a hash, then as you read the FASTQ file, check to see if it occurs in the dictionary. This would be extremely memory inefficient, since you'd have to hold all the BAM files in memory. Sadly, my python expertise is limited. All the parsing is definitely doable in python, but I don't know if there's an implementation of diff in python. If the output of TopHat chops off the '/' or '#', you can use sed to get rid of them. See next comment for that

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

sed 's/[#/].*$//' FASTQ_headers.txt (or a python regular expression equivalent) would get rid of any # or / symbols in the read. Be careful, because if they occur anywhere else in the read (they shouldn't), then it will truncate from there. Also, in my previous comment, I meant that rlong's approach would have to hold all the BAM query names in memory, not the whole file.

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

One more thing...will tophat output reads where only one paired end maps?

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

I like this solution. You only traverse each input file once, limiting IO. Then if you UNKNOWN cares about the contents of the reads, and not just the read names, you have final traversal of the fastq to drop out the reads that match your list of those in the fastq but not the bam.

ADD REPLYlink written 7.4 years ago by Rlong340

@mitch: i am running it on each mate separately so it's not an issue

ADD REPLYlink written 7.4 years ago by User 9996800

Thanks! It comes from long hours trying to find BASH commands that do things I need without me needing to do lots of coding.

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

the mate id would not be a problem if you run it on each end separately. if i could find a way to efficiently sort the FASTQ by entry id i think rlong's approach would work, since sorting the IDs or sorting the ids where each ends in a "/1" is the same thing.

ADD REPLYlink written 7.4 years ago by User 9996800

Great to know @google. If you were mapping paired-end reads together, there's another bash trick that would tell you if one or both reads didn't map, but it's not necessary here.

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

The sort for reads ending in a '/1' isn't the problem, from my view. It's the string comparison with the BAM file, which (from what I've seen) won't contain the trailing '/1', and so diff will not be able to match any read headers between the FASTQ and BAM files

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

Thanks Mitch. I tried the solution I described here: http://biostar.stackexchange.com/questions/15056/how-to-efficiently-sort-a-fastq-file-by-entry-id -- but it's pretty slow, it takes 10 minutes per fastq file. Is that expected??

ADD REPLYlink written 7.4 years ago by User 9996800

Unfortunately, yes; for large files, sorting can take a bit of time. Depending on the sorting algorithm used and how many entries you're trying to sort, 10 minutes isn't terrible, even if it seems like it is ;)

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

great thanks for all the comments mitch and rlong, was very helpful!

ADD REPLYlink written 7.4 years ago by User 9996800

My pleasure! Happy to help...

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

Just when I thought it was working.. I get errors like: "sort: write failed: /tmp/sortV2AYRD: No space left on device" -- is that normal? how can it be avoided?

ADD REPLYlink written 7.4 years ago by User 9996800

Are you sorting in python or on the command line in bash? It sounds like something is out of space. Either it's your hard drive (check with df if you're on unix), or the sort is running out of RAM when it loads all the read headers into memory. If your hard drive still has lots of space, your best bet is to go with command line sort or samtools sort. When they encounter a file that's too large, they'll split the file into smaller chunks, sort the smaller chunks, then merge them together. It would be a pain to code that yourself if you're trying to get it done on a Sunday night.

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

that error actually looks like it's when the command line sort is running out of space. How big are your header files from the bam or fastq?

ADD REPLYlink written 7.4 years ago by Damian Kao15k

i've managed to get it to work the problem now is that the downstream step that rlong suggested doesn't work. i think sort from unix sorts differently than sort -n in bamtools?

ADD REPLYlink written 7.4 years ago by User 9996800

From the samtools man page, sort -n sorts by read header, which should give the same output. When you say it doesn't work, does the file not look sorted? Or is there some other issue?

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

The file gets sorted by sort, but it thinks that "@42EBKAAXX090828:6:73:204:1871/2" comes before "@42EBKAAXX090828:6:1:270:128/2", for example. not sure why. does unix "sort" do something funny with alphanumeric strings?

ADD REPLYlink written 7.4 years ago by User 9996800

Example:

$ cat test.fq | perl mergelines.pl | sort --stable -k1,1 | perl splitlines.pl > sorted
$ head sorted
@42EBKAAXX090828:6:100:1699:328/2
@42EBKAAXX090828:6:10:1077:1883/2

as you can see it thinks that "@42EBKAAXX090828:6:100:1699:328/2" comes before "@42EBKAAXX090828:6:10:1077:1883/2"

ADD REPLYlink written 7.4 years ago by User 9996800

It's an alphanumeric sort, so it is treating the query names as ASCII values. The sort you're seeing is because 42EBKAAXX090828:6:100:1699:328/2 has a '0' in the same position that 42EBKAAXX090828:6:10:1077:1883/2 has a ':', and ':' comes after '0'. Samtools appears to split the query name by ':' and sort, which will be different than unix sort output. Use this command: sort -t ':' -k 1,1 -k2,2n -k3,3n -k4,4n -k 5,5 (split by ':', numeric sort on fields 2,3,4, alphanumeric sort on 1 and 5) if you want to use samtools sort, or you can just use unix sort on both files.

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k

@Mitch: is your sort command equivalent to using -V to "sort"? thanks.

ADD REPLYlink written 7.4 years ago by User 9996800

I looked in my local unix sort and in samtools sort and I don't see -V anywhere. What sort binary are you using that has that option?

ADD REPLYlink written 7.4 years ago by Mitch Bekritsky1.1k
3
gravatar for Rm
7.4 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

follow "seqanswers" thread below: to extract unmapped fastqs using original Fastq and bam file (especially from output from tophat or other programms which does not include unmapped reads in bam file)

http://seqanswers.com/forums/showthread.php?t=6847&referrerid=2547

to extract unmapped reads from bam file (from newer versions of bwa etc...) which includes unmapped reads follow this:

http://www.novocraft.com/wiki/tiki-index.php?page=Extracting+unmapped+reads+from+a+BAM+file+produced+by+novoalign

ADD COMMENTlink written 7.4 years ago by Rm7.8k

it seems like the perl script in the first forum link would be inefficient memory wise, since it has to load all the IDs into memory - no?

ADD REPLYlink written 7.4 years ago by User 9996800

I dont think IDs list will eat up much of the memory...I have used the script it runs pretty fast...

ADD REPLYlink written 7.4 years ago by Rm7.8k
2
gravatar for Rlong
7.4 years ago by
Rlong340
US
Rlong340 wrote:

One approach would be to sort your bam by read name:

samtools sort -n your.bam

[EDIT: thanks to unknown] The fastq will also need to be sorted by read_names for this to work.

Open a samtools view of the bam in a pipe, and step through the fastq file record by record. If the read-name of your current record sorts to before the current read_name in your samtools view pipe, then that fastq line is not in your bam. If they match, you have a read that exists in both files. If the the read_name from your fastq file sorts to after the current read_name in your samtools view pipe, you have reads in your bam and not in your fastq.

ADD COMMENTlink modified 7.4 years ago • written 7.4 years ago by Rlong340
1

Doesn't the FASTQ have to be sorted too for this to work?

ADD REPLYlink written 7.4 years ago by User 9996800

What is the computational complexity of the algorithm samtools sort -n uses?

ADD REPLYlink written 7.4 years ago by User 9996800

Weird, I had to go to the source to find this. Not sure if I just didn't hit right search terms or what. Samtools sort looks like it leans on Heng Li's implementation of mergesort, his ks_mergesort, located in ksort.h.

ADD REPLYlink written 7.4 years ago by Rlong340

How can I sort the FASTQ file by ID? I'd prefer to use Python... can BioPython do this??

ADD REPLYlink written 7.4 years ago by User 9996800

I still think this approach doesn't work. If you have repetitions in the BAM file, how could it work?

ADD REPLYlink written 7.4 years ago by User 9996800

also, if sorting with unix command sort, do i have to use sort -n?

ADD REPLYlink written 7.4 years ago by User 9996800

I'm not sure about unix sort. Indeed it would have to mimic the sorting of samtools sort. Regarding the repetitions in the bam, that shouldn't matter. Let's say you have a read name from your fastq, for simplicity we'll say it is "aab". So if you have two "aab" reads in your bam, which you certainly will if this is paired end data, then when you hit "aab" in your fastq, three things can happen. If the last read name from the bam is "aaa", grab another read from the bam pipe. If this is also "aaa", draw another. If you get "aac", there are no "aab" reads in the bam. --- continued ---

ADD REPLYlink written 7.4 years ago by Rlong340

if you get "aab" from the bam, you grab another read from the fastq. Suppose it is "aac". Then you will compare "aab" with "aac" and draw another read from the bam pipe. If the next read is "aab" as well, no problem, you'll keep drawing them until you get "aac" or something beyond it.

ADD REPLYlink written 7.4 years ago by Rlong340

I can't get unix "sort" to mimic the samtools sort -n.... any ideas on this?

ADD REPLYlink written 7.4 years ago by User 9996800

when do you advance to next read?

ADD REPLYlink written 7.4 years ago by User 9996800
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1136 users visited in the last hour