Sam To Bam Conversion Reference Fasta File
4
1
Entering edit mode
12.1 years ago
Atom Smasher ▴ 20

Hello,

I had some data in "eland_export.txt" format and I want to convert it to the SAM format first and then later to the BAM format.

To convert "eland_export.txt" to SAM format, I used SAMTools' "export2sam.pl" script which did give me a SAM file but without SQ headers.

When I tried to convert this SAM file to BAM format using the command :

samtools view -bS infile.sam > outfile.bam

it generated the following error :

[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!

This error was obviously due to the fact that there were no header SQ lines in my SAM file. However, I learnt that I could still convert SAM to BAM using the -t option.

However this -t option requires a reference fasta file. Could someone let me know from where I can get this fasta file ?

I am working with human genomic data from the Mar 2006 build. Is there one single "reference fasta" file for the human genome ? Or does this reference fasta file depend on the data that I am working on in any way ?

Thank you.

bam sam • 17k views
ADD COMMENT
5
Entering edit mode
12.1 years ago

There are a number of sources for reference fasta files, such as UCSC or GRC, but you should use the same reference that your reads were aligned to.

The -t option in samtools view needs a text file describing the reference sequences, rather than the fasta file itself. You can produce this with the faidx command in samtools:

samtools faidx all_sequences.fa

then use the output

samtools view -bS -t all_sequences.fai infile.sam > outfile.bam
ADD COMMENT
0
Entering edit mode

Hello Dave. Yes I had not produced the index file (.fai) using faidx earlier. However, even after producing this file and running samtools, I get some error:

[sam_read1] reference 'hs_ref_chrY' is recognized as '*'.
[sam_read1] reference 'hs_ref_chr10' is recognized as '*'.
[sam_read1] reference 'hs_ref_chr20' is recognized as '*'.
[sam_read1] reference 'hs_ref_chr14' is recognized as '*'.
[sam_read1] reference 'hs_ref_chr6' is recognized as '*'
ADD REPLY
2
Entering edit mode
12.1 years ago

UCSC hosts copies of the NCBI reference files. Here's the directory with hg18 (2006/build 36): http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/

I believe the file you're looking for is chromFa.zip

It comes packaged as one chromosome per file, but you'll want to cat them all together into one big file and use that as the input to samtools:

cat *.fa >all_sequences.fa
ADD COMMENT
0
Entering edit mode

Thanks a ton ! :)

ADD REPLY
0
Entering edit mode

Thanks ! :)

However after concatenating, when I tried to use the all_sequences.fa file as input to Samtools for BAM conversion, it went into an infinite loop printing:

[sam_header_read2] duplicated sequence name: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Trying to figure out what this is.

ADD REPLY
0
Entering edit mode

See Dave's comment below for more info

ADD REPLY
1
Entering edit mode
12.1 years ago

You can also just add your own header lines to the file. You just need to get the length of the chromosomes in your alignments and put them in the right format, eg. for human:

@SQ    SN:chr1    LN:249250621
@SQ    SN:chr2    LN:243199373
...

There are several ways to get chromosome lengths for your organism. If you tell us which organism you are using, perhaps someone can provide them for you as well.

ADD COMMENT
1
Entering edit mode

Check this file out: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chromInfo.txt.gz to make your SAM header. Replace hg18 to other UCSC dbname if you are working on other species.

ADD REPLY
0
Entering edit mode

Hello Steve. I am using the Human Genome and the March 2006 build i.e hg18.

ADD REPLY
0
Entering edit mode
12.1 years ago
Atom Smasher ▴ 20

Thanks ! :) However after concatenating, when I tried to use the "all_sequences.fa" file as input to Samtools for BAM conversion, it went into an infinite loop printing:

[sam_header_read2] duplicated sequence name: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Trying to figure out what this

ADD COMMENT

Login before adding your answer.

Traffic: 2022 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6