Question: How To Get Fasta Format Using Fastafrombed Or How To Turn Linearized Fasta To The Same Length Columns
1
gravatar for PoGibas
5.7 years ago by
PoGibas4.7k
Vilnius
PoGibas4.7k wrote:

I extracted sequences with fastaFromBed and have no complains about the BEDTools which is really awesome thing.

Otherwise extracted sequences look like this:

>chr19:13985513-13985622   
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
>chr19:13985689-13985825  
TCCCCTCCCCTAGGCCACAGCCGAGGTCACAATCAACATTCATTGTTGTCGGTGGGTTGTGAGGACTGAGGCCAGACCCACCGGGGGATGAATGTCACTGTGGCTGGGCCAGACACG

And my input file looks like this:

>chr19
agtcccagctactcgggaggctaaggcaggagaatcgcttgaacccagga
ggtggaggttgcagggagccgagatcgcaccactgcactccagcctgggc
gacagagcgagattccgtctcaaaaagtaaaataaaataaaataaaaaat
aaaagtttgatatattcagaatcagggaggtctgctgggtgcagttcatt
tgaaaaattcctcagcattttagtGATCTGTATGGTCCCTCtatctgtca
gggtcctagcaggaaattgttgcactctcaaaggattaagcagaaagagt

I was using this:

fastaFromBed -fi input -bed seq.bed -fo output

So shouldn't those sequences be formed in FASTA format (as ncbi says "It is recommended that all lines of text be shorter than 80 characters in length") or at least the same line length as my input file?

What I am doing wrong that I am getting linearized (fasta?) output with fastaFromBed?
What is the quickest way to turn those linear sequences to nicely formatted columns using command line?

fasta bedtools • 4.7k views
ADD COMMENTlink modified 5.7 years ago by lh331k • written 5.7 years ago by PoGibas4.7k

That is a terrible recommendation - but then NCBI is the King of Terrible Formats. I recommend NOT wrapping sequence data. 1) With the massive amount of sequences we have these days, the "human readable" argument don't hold up. 2) With wrapped sequences grep fails at the line ends. 3) You cant easily random access wrapped FASTA a files, 4) It's waste of precious newlines.

ADD REPLYlink written 5.7 years ago by Martin A Hansen3.0k
1

I disagree. I frequently "less" reads and genomic sequences/subsequences to get a feeling about lengths, the repetitiveness, lowercase/uppercase patterns, etc. Sequences should be human readable no matter how long it is. We should not trust our programs too much. Human eyes are frequently more effective in identifying problems. As to your other concerns: 2) with sequences in one line, grep will return the entire sequence, which is frequently pointless. 3) as long as sequence lines are of the same length, you can use bioperl::DB::fasta or faidx strategy for quick random access. There is only a little more work. 4) I am not sure why newlines are precious.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by lh331k

I totally agree with you and prefer my sequences store in linear mode, but it was just a xxx database thing.

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by PoGibas4.7k
4
gravatar for Neilfws
5.7 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

The NCBI recommendation is that sequence lines in FASTA be 80 characters or less, but it is not enforced. Some software outputs sequence all on one line. Most, if not all, software that reads FASTA is able to cope with this, without errors.

There are lots of ways that you could reformat. One is to use seqret from the EMBOSS package, to simply read in the FASTA formatted sequences and write them out again in tidy FASTA format:

seqret -sequence myfasta.fa -outseq mytidyfasta.fa
ADD COMMENTlink written 5.7 years ago by Neilfws48k
4
gravatar for lh3
5.7 years ago by
lh331k
United States
lh331k wrote:

Another option is seqtk:

seqtk seq -l 80 in.fa > out.fa

It is faster than seqret. Seqtk also implements some bedtools functionality:

seqtk subseq -l 60 in.fa reg.bed > out.fa  # fastaFromBed with the support of fastq and customized line length
seqtk seq -M reg.bed in.fa > out.fa        # maskFastaFromBed
ADD COMMENTlink written 5.7 years ago by lh331k
1
gravatar for Pierre Lindenbaum
5.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

You can pipe the ouput into 'fold -w 80'

$ echo -e ">chr19:13985513-13985622\nGGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAGGCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT" | fold -w 60
>chr19:13985513-13985622
GGAAAATTTGCCAAGGGTTTGGGGGAACATTCAACCTGTCGGTGAGTTTGGGCAGCTCAG
GCAAACCATCGACCGTTGAGTGGACCCTGAGGCCTGGAATTGCCATCCT
ADD COMMENTlink written 5.7 years ago by Pierre Lindenbaum112k

Assuming that the header lines are < 60 characters, of course :)

ADD REPLYlink written 5.7 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1883 users visited in the last hour