Question: Faster way to get sequences from large fasta files?
0
gravatar for emilywynn6
12 months ago by
emilywynn610
emilywynn610 wrote:

Hello,

I am working with Illumina paired end reads. I am trying to use Blast to find read pairs in which one pair aligns to the nucleus and the other pair aligns to the mitochondria. Any reference genome I could align the reads to would not have the gene transfers I'm looking for, so the unaligned fastq files are my best bet.

Basically what I've done so far is turn the fastq files into fasta files and blasted them to the mitochondrial genome, then I've taken the read names from the blast output and used sed to change the ".1's" to ".2's" so that I have the name of that read's pair.

So I have a text file with 100000 read names, and I need to get the sequences of these reads. Originally I used grep to find the names and print the next line (grep -f -A1) however, because my files are so large, this took too much RAM and I had to cancel the task.

I found that using: while read line; do grep -wF -A1 "$line" query.file; done < subject.file gives me the output I need and doesn't crash our server, but it is also EXTREMELY slow.

Any advice on how to more efficiently perform this task? Even using grep -m 1 will take over a day to finish each sample.

paired-ends grep fasta • 368 views
ADD COMMENTlink modified 12 months ago by mike-zx140 • written 12 months ago by emilywynn610
1

Wouldn't STAR-Fusion or Tophat-Fusion give you the insight you need?

With these tools, you'll detect chimeric reads and search for those with one read in the nucleus and the other in mt.

ADD REPLYlink written 12 months ago by michael.ante3.5k

I think you can use samtools faidx to index the FASTA, that'll make the access faster, although I'm not sure if any tools exist that can use the faidx to speed up their operation as well as use identifiers to get the sequence.

You could try samtools faidx file.fasta; samtools faidx file.fasta <identifier>

ADD REPLYlink modified 12 months ago • written 12 months ago by RamRS24k
3
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

linearize your fasa file, sort and use joiin.

Not tested but it should be something like

join  -t $'\t' -1 1 -2 1 \
   <(cat query.file | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
   <(sort subject.file) |\
tr "\t" "\n"
ADD COMMENTlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum123k

I've tried this out on a small test file, and I think this strategy will work and be much faster. Thank you!!!

ADD REPLYlink written 12 months ago by emilywynn610

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink written 12 months ago by Pierre Lindenbaum123k
1
gravatar for mike-zx
12 months ago by
mike-zx140
mike-zx140 wrote:

if you only need all the sequences without headers you could:

cat fasta | paste - - | cut -f2 > output

if you need each sequence in a separate file with the fasta header you could:

for id in $(cat file_with_headers_list)
do
  filename=$(cat test | grep "$id" | cut -d\> -f2)
  cat fasta | paste - - | grep $id | tr '\t' '\n' > ${filename}_output
done

These should be faster I think, don't know if I understood what you wanted correctly though so I apologize in advance if I misunderstood.

ADD COMMENTlink modified 12 months ago • written 12 months ago by mike-zx140

obviously to make the "file_with_headers_list" you would:

cat fasta | grep \> > file_with_headers_list
ADD REPLYlink written 12 months ago by mike-zx140
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2455 users visited in the last hour