Question

Off topic:extract headers of a file using a second file containing a list of IDs

0

Entering edit mode

5.7 years ago

paraskevopou ▴ 20

Dear people! Sorry my question might be very trivial but I am very new in the bioinformatic field. I have a txt file containing the headers of a fasta file (file 1, 14000 headers) and a txt file, file2, with the IDs I want to extract. My problem is that this second txt file has only the TRINITY_ID (without the coming information and without the >) and less entries than file1. Here comes the question. How can extract all the information from the header (everything that comes after >) from file1 only for those that are present in file2 ?

file1 (total file can be found here: https://www.dropbox.com/sh/dt09ij88052epr9/AAAB9A1k20dHs6Ktc-pSEt6qa?dl=0)

>TRINITY_DN10008_c0_g1.p1 GENE.TRINITY_DN10008_c0_g1~~TRINITY_DN10008_c0_g1.p1  ORF type:complete len:404 (+),score=64.53,WDFY2_HUMAN|46.898|2.25e-148 TRINITY_DN10008_c0_g1:212-1423(+)
>TRINITY_DN10008_c0_g2.p1 GENE.TRINITY_DN10008_c0_g2~~TRINITY_DN10008_c0_g2.p1  ORF type:5prime_partial len:359 (+),score=54.01,WDFY2_HUMAN|48.045|4.13e-137 TRINITY_DN10008_c0_g2:3-1079(+)
>TRINITY_DN10009_c0_g1.p1 GENE.TRINITY_DN10009_c0_g1~~TRINITY_DN10009_c0_g1.p1  ORF type:complete len:996 (+),score=231.51,EXOC4_HUMAN|27.089|1.79e-91 TRINITY_DN10009_c0_g1:26-3013(+)
>TRINITY_DN1000_c0_g1.p1 GENE.TRINITY_DN1000_c0_g1~~TRINITY_DN1000_c0_g1.p1  ORF type:5prime_partial len:185 (+),score=16.01,ASI4B_DANRE|30.657|1.37e-12 TRINITY_DN1000_c0_g1:1-555(+)

file2 (total file can be found here: https://www.dropbox.com/sh/dt09ij88052epr9/AAAB9A1k20dHs6Ktc-pSEt6qa?dl=0)

TRINITY_DN10008_c0_g1.p1 
TRINITY_DN10008_c0_g2.p1

Thanks a lot in advance for any help!

RNA-Seq • 698 views

ADD COMMENT • link 5.7 years ago by paraskevopou ▴ 20