To get more sequences, just add refseq accession numbers, comma-separated, to the list. This will work until the URL is up to about 2k in length. If you have more sequences than that, you can break your list of ids into comma-separated blocks of just under 2k, and iterate over the blocks.
I was thinking if I just download all the sequences from the ftp site first then extract the required fasta sequences using the RefSeq id list in the txt file then that should work as well. I just tried this but got stuck.
cut -c 1- test.id.txt | xargs -n 1 samtools faidx large.dowloaded.plasmid.fasta (it requires the entire Fasta header as ID in the test.id.txt file.
the entire header line e.g. gi|386858858|ref|NC_017775.1| Borrelia crocidurae str. Achema plasmid unnamed, complete sequence))
So, i need to match the RefSeq IDs (e.g NC_017775.1) from the fasta header instead of matching the entire header. any suggestions?
I was thinking if I just download all the sequences from the ftp site first then extract the required fasta sequences using the RefSeq id list in the txt file then that should work as well. I just tried this but got stuck.
cut -c 1- test.id.txt | xargs -n 1 samtools faidx large.dowloaded.plasmid.fasta
(it requires the entire Fasta header as ID in thetest.id.txt
file.the entire header line e.g.
gi|386858858|ref|NC_017775.1| Borrelia crocidurae str. Achema plasmid unnamed, complete sequence
)
)So, i need to match the RefSeq IDs (e.g
NC_017775.1
) from the fasta header instead of matching the entire header. any suggestions?to use tabix, file should be indexed with tabix on the NCBI side: They're not. Search biostars to 'grep' on fasta file on its name.