Question: Sam To Bam Conversion Reference Fasta File
1
gravatar for Atom Smasher
7.0 years ago by
Atom Smasher20
Atom Smasher20 wrote:

Hello,

I had some data in "eland_export.txt" format and I want to convert it to the SAM format first and then later to the BAM format.

To convert "eland_export.txt" to SAM format, I used SAMTools' "export2sam.pl" script which did give me a SAM file but without SQ headers.

When I tried to convert this SAM file to BAM format using the command :

samtools view -bS infile.sam > outfile.bam

it generated the following error :

[samopen] no @SQ lines in the header.
[sam_read1] missing header? Abort!

This error was obviously due to the fact that there were no header SQ lines in my SAM file. However, I learnt that I could still convert SAM to BAM using the -t option.

However this -t option requires a reference fasta file. Could someone let me know from where I can get this fasta file ?

I am working with human genomic data from the Mar 2006 build. Is there one single "reference fasta" file for the human genome ? Or does this reference fasta file depend on the data that I am working on in any way ?

Thank you.

bam sam • 14k views
ADD COMMENTlink written 7.0 years ago by Atom Smasher20
5
gravatar for Dave Richardson
7.0 years ago by
Cambridge, UK
Dave Richardson370 wrote:

There are a number of sources for reference fasta files, such as UCSC or GRC, but you should use the same reference that your reads were aligned to.

The -t option in samtools view needs a text file describing the reference sequences, rather than the fasta file itself. You can produce this with the faidx command in samtools:

samtools faidx all_sequences.fa

then use the output

samtools view -bS -t all_sequences.fai infile.sam > outfile.bam
ADD COMMENTlink written 7.0 years ago by Dave Richardson370

Hello Dave. Yes I had not produced the index file (.fai) using faidx earlier. However, even after producing this file and running samtools, I get some error : n

[sam_read1] reference 'hs_ref_chrY' is recognized as ''. [sam_read1] reference 'hs_ref_chr10' is recognized as ''. [sam_read1] reference 'hs_ref_chr20' is recognized as ''. [sam_read1] reference 'hs_ref_chr14' is recognized as ''. [sam_read1] reference 'hs_ref_chr6' is recognized as '*'.

ADD REPLYlink written 7.0 years ago by Atom Smasher20
2
gravatar for Chris Miller
7.0 years ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

UCSC hosts copies of the NCBI reference files. Here's the directory with hg18 (2006/build 36): http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/

I believe the file you're looking for is chromFa.zip

It comes packaged as one chromosome per file, but you'll want to cat them all together into one big file and use that as the input to samtools:

cat *.fa >all_sequences.fa
ADD COMMENTlink written 7.0 years ago by Chris Miller20k

Thanks a ton ! :)

ADD REPLYlink written 7.0 years ago by Atom Smasher20

Thanks ! :)

However after concatenating, when I tried to use the "all_sequences.fa" file as input to Samtools for BAM conversion, it went into an infinite loop printing:

[sam_header_read2] duplicated sequence name: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Trying to figure out what this is.

ADD REPLYlink written 7.0 years ago by Atom Smasher20

See Dave's comment below for more info

ADD REPLYlink written 7.0 years ago by Chris Miller20k
1
gravatar for Steve Lianoglou
7.0 years ago by
Steve Lianoglou5.0k
US
Steve Lianoglou5.0k wrote:

You can also just add your own header lines to the file. You just need to get the length of the chromosomes in your alignments and put them in the right format, eg. for human:

@SQ    SN:chr1    LN:249250621
@SQ    SN:chr2    LN:243199373
...

There are several ways to get chromosome lengths for your organism. If you tell us which organism you are using, perhaps someone can provide them for you as well.

ADD COMMENTlink written 7.0 years ago by Steve Lianoglou5.0k
1

Check this file out: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/chromInfo.txt.gz to make your SAM header. Replace hg18 to other UCSC dbname if you are working on other species.

ADD REPLYlink written 7.0 years ago by Tao Liu10

Hello Steve. I am using the Human Genome and the March 2006 build i.e hg18.

ADD REPLYlink written 7.0 years ago by Atom Smasher20
0
gravatar for Atom Smasher
7.0 years ago by
Atom Smasher20
Atom Smasher20 wrote:

Thanks ! :) However after concatenating, when I tried to use the "all_sequences.fa" file as input to Samtools for BAM conversion, it went into an infinite loop printing:

[sam_header_read2] duplicated sequence name: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Trying to figure out what this

ADD COMMENTlink written 7.0 years ago by Atom Smasher20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1220 users visited in the last hour