Alignment with STAR produces unique bam files, but identical output from bedtools multicov
1
0
Entering edit mode
8.0 years ago
dec986 ▴ 370

Hello,

I have a strange technical problem where many different (and unique) alignment files from STAR somehow get identical output from bedtools multicov. I used the same procedure with Tophat and it worked fine:

I use the following command in STAR:

STAR --outSAMattributes All --runThreadN 8 --genomeDir /home/con/GENE_DATA/mm10 --readFilesIn $fastq1 $fastq2

the aligned bam files are different, sorted and indexed by samtools:

con@ubuntu:~/Embryo$ ll Epiblast/Male/HFD/?/4/sorted.bam
-rw-rwxr-- 1 con david 384127141 May  4 04:53 Epiblast/Male/HFD/1/4/sorted.bam\
-rw-rwxr-- 1 con david 333413416 May  4 00:43 Epiblast/Male/HFD/2/4/sorted.bam
-rw-rwxr-- 1 con david 575693742 May  4 02:28 Epiblast/Male/HFD/3/4/sorted.bam
-rw-rwxr-- 1 con david 464273054 May  4 05:57 Epiblast/Male/HFD/4/4/sorted.bam

and processing the aligned bam files from bedtools:

con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/1/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp1
con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/2/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp2
con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/3/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp3
con@ubuntu:~/Embryo$ bedtools multicov -split -D -bams Epiblast/Male/HFD/4/4/sorted.bam -bed ~/GENE_DATA/mm10/mm10.bed > tmp4

gives four identical files:

con@ubuntu:~/Embryo$ ll tmp?
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp1
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp2
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp3
-rw-r--r-- 1 con david 5270039 May  5 08:18 tmp4
con@ubuntu:~/Embryo$ diff tmp1 tmp2
con@ubuntu:~/Embryo$ diff tmp1 tmp3

con@ubuntu:~/Embryo$ diff tmp1 tmp4

Why is the output identical? How can I fix this?

bedtools STAR samtools • 3.1k views
ADD COMMENT
0
Entering edit mode

I guess they are identical in size because they originate from the same bed file, but the numbers/counts are different?

ADD REPLY
0
Entering edit mode

is it because of the format of the bedfile?

what is the correct format of mm10.bed?

ADD REPLY
0
Entering edit mode

Could you check that files are different in content, but equal in size?

You could just check using less or head or tail or use

diff file1 file2
ADD REPLY
0
Entering edit mode

hi WouterDeCoster,

I did that, the files are exactly equal:

con@ubuntu:~/Embryo$ diff tmp1 tmp2 con@ubuntu:~/Embryo$ diff tmp1 tmp3 con@ubuntu:~/Embryo$ diff tmp1 tmp4

ADD REPLY
0
Entering edit mode

Every single gene is showing a count = 0, if this helps.

Why would bedtools multicov show a count of 0 for everything?

ADD REPLY
1
Entering edit mode

Common problem is that your bed file has 'chr' notation, while your bam file (and genome index) has integer notation, or vice versa. Could you check that?

ADD REPLY
0
Entering edit mode

Hi WouterDeCoster,

my genome reference file looks like this:

con@ubuntu:/mnt/data/con/Embryo/Epiblast/Female/HFD/4/1$ head /home/con/GENE_DATA/mm10/mm10.bed
chr1    3214481 3671498 NM_001011874    0   -   3216021 3671348 0   3   2487,200,947,   0,207220,456070,
chr1    4290845 4409241 NM_001195662    0   -   4292980 4409187 0   4   2167,172,636,72,    0,61064,61356,118324,
chr1    4343506 4360314 NM_011283   0   -   4344599 4352825 0   4   6585,172,636,115,   0,8403,8695,16693,
chr1    4490927 4497354 NM_001289466    0   -   4491715 4495155 0   4   1741,92,63,1064,    0,2844,4208,5363,
chr1    4490927 4497354 NM_011441   0   -   4491715 4493406 0   5   1741,367,92,807,1064,   0,2172,2844,4208,5363,
chr1    4490927 4497354 NM_001289464    0   -   4491715 4493406 0   4   1741,391,92,1064,   0,2172,2844,5363,
chr1    4490927 4497354 NM_001289465    0   -   4491715 4495155 0   4   1741,92,807,1064,   0,2844,4208,5363,

and the header of the bam file sorted.bam is

con@ubuntu:/mnt/data/con/Embryo/Epiblast/Female/HFD/4/1$ samtools view -H sorted.bam
@HD VN:1.4  SO:coordinate
@SQ SN:1    LN:195471971
@SQ SN:10   LN:130694993
@SQ SN:11   LN:122082543
@SQ SN:12   LN:120129022
@SQ SN:13   LN:120421639
@SQ SN:14   LN:124902244
@SQ SN:15   LN:104043685
@SQ SN:16   LN:98207768
@SQ SN:17   LN:94987271
@SQ SN:18   LN:90702639
@SQ SN:19   LN:61431566
@SQ SN:2    LN:182113224
@SQ SN:3    LN:160039680
@SQ SN:4    LN:156508116
@SQ SN:5    LN:151834684
@SQ SN:6    LN:149736546
@SQ SN:7    LN:145441459
@SQ SN:8    LN:129401213
@SQ SN:9    LN:124595110
ADD REPLY
1
Entering edit mode

That's something we can fix easily ;-) Run the following on your gtf file:

sed 's/^chr//' yourfile.gtf > newfile.gtf

This will change the "chr1" to "1" etc.

ADD REPLY
0
Entering edit mode

I tried this, I'm still getting the empty bedtools file :(

ADD REPLY
0
Entering edit mode

Hate to ask but did you use "newfile.gtf" for the last step? Are there any errors or just an empty file?

ADD REPLY
0
Entering edit mode

Wonder if this is happening because you are not providing all your BAM's with a single multicov command.

bedtools multicov, reports the count of alignments from multiple position-sorted and indexed BAM files that overlap intervals in a BED file.

Multicov expects multiple BAM input.

Use genomecov/coverage if you need independent coverage values.

ADD REPLY
0
Entering edit mode

Hi genomax2,

on the bedtools manual page for multicov, http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html

multicov works for single bam files.

Also, as you suggested, I tried the coverage option, but this produces output like

1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  332 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  333 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  334 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  335 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  336 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  337 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  338 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  339 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  340 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  341 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  342 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  343 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  344 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  345 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  346 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  347 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  348 0
1   4776763 4777552 NS500234:165:HTVJHBGXX:1:12307:8321:15868/2 255 +   4776763 4777552 0,0,0   2   38,28,  0,761,  349 0

which I can't use because it doesn't have the transcript information contained in mm10.bed.

I just want counts from bam file, I don't know what's wrong :(

ADD REPLY
3
Entering edit mode

You want counts for what (genes or the features in your BED file)? You could use featureCounts (use a GTF file for MM10 if you want gene level counts or modify your BED file to a minimal GTF/SAF format) to get the counts.

ADD REPLY
1
Entering edit mode
7.9 years ago
dec986 ▴ 370

The answer here is that for some reason bedtools cannot process the bam file, but featureCounts can, this is the solution given by genomax2.

ADD COMMENT

Login before adding your answer.

Traffic: 2426 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6