Entering edit mode
4.2 years ago
Anu • 0
I have NGS data of a sample ( .bam files) and I want to get a full FASTA seqeunces for my analysis. I have converted .bam files in to fasta using samtools and the result is like this.
>FCC332NACXX:8:1202:5499:10405#/1 TCACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTTACCCGAACCCGAACCCGAACCCGAACCCTAGCCCTA >FCC332NACXX:8:2209:15007:71606#/2 TACACTCTTTCCCCAACCCCAACCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAAACCCTAAACCCTAACACTAACC >FCC332NACXX:8:1116:16641:80184#/2 GAACCCGAACCCGAACCCGAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCGAACCCGAACCCGAACCCGAACCCTAACCCGAAC >FCC332NACXX:8:1305:3227:74171#/1 CGAACCCGAACCCGAACCCTAACCCGAACCCGAACCCTAACCCGAACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTCACCCGAA >FCC332NACXX:8:2210:8091:41229#/2 ACCCTCACCCTCACCCTCACCCTCACCCTAACCCCTAACCCCTAACCCCTAACCCTAACCCTACCCTAACCCTAACCCTAACCCTTACCCGAACCCGAAC
I want to get the full sequences similar to the one below.
>chr1:43585222-43586222 GGAACTGCGTTCCTTTGGAGGAGGAGAGGCGCTCTGCGTTTTAGAGTTTC CAGTTTTTCTGTTCTGTTTTTTCCCCATCTTTGTGGTTTTATCAACTTTT GGTCTTTGATGATGGTGATGTACAGATGGGTTTTTGGTGTGGATGTCCTT >chr7:47247166-47248166 AAATAGCCGTGTTCAGGCCACATGGGATAGCAGGAAATGACGCCTTTTTT TTTTTTTTTTTTTTTTTTTTTTTGAGACGGTGTCTCGCTCTGTTGCCCAG GCTGGAGTGCAGTGGTACAATCTCAGCTCACGGCAAGCTCCACCTCCCGG
How can I get that? I am using linux/ubuntu. Thank you!
This looks like you want to create a consensus sequence.
For this you first have to call the variants. Afterwards you can include this variants into the reference sequence.
Have a look at the examples in the manual by bcftools consensus for further instructions.
It is not clear if OP wants a consensus or just retrieval of original segment of sequence from the reference which contains inverted repeats.
What was the point of doing this? And how is this related to the example sequences you show in part 2? Is that the genomic region where each of your read aligns?
i wanted to search for a specific sequence in the data. Ex: lets say i want to search for inverted repeats in the sequnce, for that kind of things i think we need the fasta format
Can you comment on how the two examples (part I and II) are related logically? We know you want to get from part I to part II but it is unclear what relationship they have.
part 1 is the fasta format i get after i convert the .bam files to fasta.. when i run my code to find the " inverted repeats" in that format i don't get the desired output. Thats i want to get the my part1 as part 2. My codes work for part 2 format. Therefore, i want to convert my part 1 similar to part 2 ( like removing all the headers in part 1 and organize the sequences under their chromosome) .
By doing that you are just converting original reads to fasta format. You are losing alignment information (if it is present in your BAM file) in the process. My assumption is that
>chr1:43585222-43586222can only be gleaned from the actual alignments, which would be required to get to ⬇
Can i keep the alignment information and still get the fasta format? i have original fastq and .bam files. Is there a way to get the fasta format like in part 2?
What you did by "converting" the bam to fasta is just getting the original reads back. A bam file just contains the reads mapped to their location. Your conversion was merely an extraction of those reads. You did not get any new information.
In contrast, what you would have to do is a (de novo) assembly of those reads. You don't want to compare them to a reference genome read per read; you want to stitch them together to form a larger contig: the longer fasta records that you are looking for. This is however not a trivial procedure.
Try this: https://www.ebi.ac.uk/Tools/sfc/emboss_seqret/