ncbi-datasets-cli-高效便捷下载NCBI数据

news2024/11/25 14:48:57

文章目录

  • 简介
  • 安装
  • ```datasets download```下载基因组/基因序列
    • 按照GCA list文件编号下载
    • 下载大基因组
    • genome完整参数
    • gene参数
  • ```datasets summary```下载元数据
  • ```dataformat```将json转换成表格格式
  • 通过json文件解析其他字段
  • 问题

简介

NCBI Datasets 可以轻松从 NCBI 数据库中收集数据。使用命令行界面(CLI)工具或 NCBI Datasets 网页界面查找和下载基因和基因组的序列、注释和元数据。如下是可用的工具:
在这里插入图片描述

安装

  • 使用conda安装Datasets CLI tools, datasetsand dataformat:
# 注意不是datasets而是ncbi-datasets-cli
$ conda install -c conda-forge ncbi-datasets-cli
(base) [yut@io02 ~]$ datasets --version
datasets version: 15.25.0

datasets download下载基因组/基因序列

datasets从 NCBI 下载所有生命领域的生物序列数据,dataformat将前者下载的数据包中的元数据从 JSON Lines 格式转换为其他格式。

使用datasets下载人类参考基因组 GRCh38 的基因组数据包:

$ datasets download genome taxon human --reference --filename human-reference.zip

使用 dataformat从下载的人类参考基因组 GRCh38 数据包中提取选定的元数据字段:

$ dataformat tsv genome --package human-reference.zip --fields organism-name,assminfo-name,accession,assminfo-submitter
Organism name	Assembly Name	Assembly Accession	Assembly Submitter
Homo sapiens	GRCh38.p14	GCF_000001405.40	Genome Reference Consortium

按照GCA list文件编号下载

(base) [yut@io02 02_Glacier_new_taxa]$ head 3.gca
GCF_020042285.1
GCF_020783315.1
GCF_024343615.1
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile 3.gca --include gff3,rna,cds,protein,genome,seq-report --filename  3genome.zip
# --inputfile:输入GCA号的list,每行一个
# --filename:输出zip包名称,默认ncbi-dataset.zip

New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 3 genome records [================================================] 100% 3/3
Downloading: 3genome.zip    10.4MB valid zip archive
Validating package files [================================================] 100% 18/18

real    0m8.208s
user    0m0.652s
sys     0m0.234s
(base) [yut@io02 02_Glacier_new_taxa]$ ls
3genome.zip  download.log 

下载大基因组

下载大量基因组,首先下载压缩包,然后分三步访问数据。

  • 1.下载人基因组压缩包
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip
  • 2.解压
unzip human_GRCh38_dataset.zip -d my_human_dataset
  • 3.转换格式
datasets rehydrate --directory my_human_dataset/

genome完整参数

(base) [yut@io02 ~]$ datasets download genome  --help

Download a genome data package. Genome data packages may include genome, transcript and protein sequences, annotation and one or more data reports. Data packages are downloaded as a zip archive.

The default genome data package includes the following files:
  * <accession>_<assembly_name>_genomic.fna (genomic sequences)
  * assembly_data_report.jsonl (data report with genome assembly and annotation metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download genome [flags]
  datasets download genome [command]

Sample Commands
  datasets download genome accession GCF_000001405.40 --chromosomes X,Y --include genome,gff3,rna
  datasets download genome taxon "bos taurus" --dehydrated
  datasets download genome taxon human --assembly-level chromosome,complete --dehydrated
  datasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydrated

Available Commands
  accession   Download a genome data package by Assembly or BioProject accession
  taxon       Download a genome data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
      --annotated                Limit to annotated genomes
      --assembly-level string    Limit to genomes at one or more assembly levels (comma-separated):
                                   * chromosome
                                   * complete
                                   * contig
                                   * scaffold
                                    (default "[]")
      --assembly-source string   Limit to 'RefSeq' (GCF_) or 'GenBank' (GCA_) genomes (default "all")
      --chromosomes strings      Limit to a specified, comma-delimited list of chromosomes, or 'all' for all chromosomes
      --dehydrated               Download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
      --exclude-atypical         Exclude atypical assemblies
      --mag string               Limit to metagenome assembled genomes (only) or remove them from the results (exclude) (default "all")
      --preview                  Show information about the requested data package
      --reference                Limit to reference genomes
      --released-after string    Limit to genomes released on or after a specified date (MM/DD/YYYY)
      --released-before string   Limit to genomes released on or before a specified date (MM/DD/YYYY)
      --search strings           Limit results to genomes with specified text in the searchable fields:
                                 species and infraspecies, assembly name and submitter.
                                 To search multiple strings, use the flag multiple times.


Global Flags
      --api-key string    Specify an NCBI API key
      --debug             Emit debugging info
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --help              Print detailed help about a datasets command
      --no-progressbar    Hide progress bar
      --version           Print version of datasets

Use datasets download genome <command> --help for detailed help about a command.

gene参数

(base) [yut@io02 ~]$ datasets download gene --help

Download a gene data package.  Gene data packages include gene, transcript and protein sequences and one or more data reports. Data packages are downloaded as a zip archive.

The default gene data package for NM, NR, NP, XM, XR, XP and YP accessions:
  * rna.fna (transcript sequences)
  * protein.faa (protein sequences)
  * data_report.jsonl (data report with gene metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download gene [flags]
  datasets download gene [command]

Sample Commands
  datasets download gene gene-id 672
  datasets download gene symbol brca1 --taxon mouse
  datasets download gene accession NP_000483.3
  datasets download gene gene-id 2778 --fasta-filter NC_000020.11,NM_001077490.3,NP_001070958.1

Available Commands
  gene-id     Download a gene data package by NCBI Gene ID
  symbol      Download a gene data package by gene symbol
  accession   Download a gene data package by RefSeq nucleotide or protein accession
  taxon       Download a gene data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
      --fasta-filter strings       Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions
      --fasta-filter-file string   Limit protein and RNA sequence files to the specified RefSeq nucleotide and protein accessions included in the specified file
      --preview                    Show information about the requested data package


Global Flags
      --api-key string    Specify an NCBI API key
      --debug             Emit debugging info
      --filename string   Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --help              Print detailed help about a datasets command
      --no-progressbar    Hide progress bar
      --version           Print version of datasets

Use datasets download gene <command> --help for detailed help about a command.

datasets summary下载元数据

(base) [yut@io02 ~]$ datasets summary --help

Print a data report containing gene, genome or virus metadata in JSON format.

Usage
  datasets summary [flags]
  datasets summary [command]

Sample Commands
  datasets summary genome accession GCF_000001405.40
  datasets summary genome taxon "mus musculus"
  datasets summary gene gene-id 672
  datasets summary gene symbol brca1 --taxon mouse
  datasets summary gene accession NP_000483.3
  datasets summary virus genome accession NC_045512.2
  datasets summary virus genome taxon sars-cov-2 --host dog

Available Commands
  gene        Print a summary of a gene dataset
  genome      Print a data report containing genome metadata
  virus       Print a data report containing virus genome metadata

Global Flags
      --api-key string   Specify an NCBI API key
      --debug            Emit debugging info
      --help             Print detailed help about a datasets command
      --version          Print version of datasets

Use datasets summary <command> --help for detailed help about a command.

  • 实例
(base) [yut@io02 ~]$ datasets summary genome accession GCF_000001405.40
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"accession":"GCF_000001405.40","annotation_info":{"busco":{"busco_lineage":"primates_odb10","busco_ver":"4.1.4","complete":0.99187225,"duplicated":0.007256894,"fragmented":0.0015239477,"missing":0.0066037737,"single_copy":0.9846154,"total_count":"13780"},"method":"Best-placed RefSeq; Gnomon; RefSeqFE; cmsearch; tRNAscan-SE","name":"GCF_000001405.40-RS_2023_10","pipeline":"NCBI eukaryotic genome annotation pipeline","provider":"NCBI RefSeq","release_date":"2023-10-02","report_url":"https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Homo_sapiens/GCF_000001405.40-RS_2023_10.html","software_version":"10.2","stats":{"gene_counts":{"non_coding":22158,"other":413,"protein_coding":20080,"pseudogene":17001,"total":59652}},"status":"Updated annotation"},"assembly_info":{"assembly_level":"Chromosome","assembly_name":"GRCh38.p14","assembly_status":"current","assembly_type":"haploid-with-alt-loci","bioproject_accession":"PRJNA31257","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA31257","title":"The Human Genome Project, currently maintained by the Genome Reference Consortium (GRC)"}]}],"blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCF_000001405.40","description":"Genome Reference Consortium Human Build 38 patch release 14 (GRCh38.p14)","paired_assembly":{"accession":"GCA_000001405.29","only_genbank":"4 unlocalized and unplaced scaffolds.","status":"current"},"refseq_category":"reference genome","release_date":"2022-02-03","submitter":"Genome Reference Consortium","synonym":"hg38"},"assembly_stats":{"contig_l50":18,"contig_n50":57879411,"gaps_between_scaffolds_count":349,"gc_count":"1374283647","gc_percent":41,"number_of_component_sequences":35611,"number_of_contigs":996,"number_of_organelles":1,"number_of_scaffolds":470,"scaffold_l50":16,"scaffold_n50":67794873,"total_number_of_chromosomes":24,"total_sequence_length":"3099441038","total_ungapped_length":"2948318359"},"current_accession":"GCF_000001405.40","organelle_info":[{"description":"Mitochondrion","submitter":"Genome Reference Consortium","total_seq_length":"16569"}],"organism":{"common_name":"human","organism_name":"Homo sapiens","tax_id":9606},"paired_accession":"GCA_000001405.29","source_database":"SOURCE_DATABASE_REFSEQ"}],"total_count": 1}

(base) [yut@io02 ~]$ datasets summary gene gene-id 672
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
{"reports": [{"gene":{"annotations":[{"annotation_name":"GCF_000001405.40-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_000001405.40","assembly_name":"GRCh38.p14","genomic_locations":[{"genomic_accession_version":"NC_000017.11","genomic_range":{"begin":"43044295","end":"43170327","orientation":"minus"},"sequence_name":"17"}]},{"annotation_name":"GCF_009914755.1-RS_2023_10","annotation_release_date":"2023-10-02","assembly_accession":"GCF_009914755.1","assembly_name":"T2T-CHM13v2.0","genomic_locations":[{"genomic_accession_version":"NC_060941.1","genomic_range":{"begin":"43902857","end":"44029084","orientation":"minus"},"sequence_name":"17"}]}],"chromosomes":["17"],"common_name":"human","description":"BRCA1 DNA repair associated","ensembl_gene_ids":["ENSG00000012048"],"gene_groups":[{"id":"672","method":"NCBI Ortholog"}],"gene_id":"672","nomenclature_authority":{"authority":"HGNC","identifier":"HGNC:1100"},"omim_ids":["113705"],"orientation":"minus","protein_count":368,"reference_standards":[{"gene_range":{"accession_version":"NG_005905.2","range":[{"begin":"92501","end":"173689","orientation":"plus"}]},"type":"REFSEQ_GENE"}],"swiss_prot_accessions":["P38398"],"symbol":"BRCA1","synonyms":["IRIS","PSCP","BRCAI","BRCC1","FANCS","PNCA4","RNF53","BROVCA1","PPP1R53"],"tax_id":"9606","taxname":"Homo sapiens","transcript_count":368,"transcript_type_counts":[{"count":368,"type":"PROTEIN_CODING"}],"type":"PROTEIN_CODING"},"query":["672"]}],"total_count": 1}
  • 下载结果为json格式

dataformat将json转换成表格格式

(base) [yut@io02 ~]$ dataformat tsv

Convert data to TSV format.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  dataformat tsv [command]

Report Commands
  genome             Convert Genome Assembly Data Report into TSV format
  genome-seq         Convert Genome Assembly Sequence Report into TSV format
  gene               Convert Gene Report into TSV format
  gene-product       Convert Gene Product Report into TSV format
  virus-genome       Convert Virus Data Report into TSV format
  virus-annotation   Convert Virus Annotation Report into TSV format
  microbigge         Convert MicroBIGG-E Data Report into TSV format
  prok-gene          Convert Prokaryote Gene Report into TSV format
  prok-gene-location Convert Prokaryote Gene Location Report into TSV format
  genome-annotations Convert Genome Annotation Report into TSV format

Flags
      --elide-header   Do not output header
  -h, --help           help for tsv



Global Flags
      --force   Force dataformat to run without type check prompt

Use dataformat tsv <command> --help for detailed help about a command.

(base) [yut@io02 ~]$ dataformat tsv gene
Error: --inputfile and/or --packagefile must be specified, or data can be read from standard input
Usage
  dataformat tsv gene [flags]

Examples
  dataformat tsv gene --inputfile gene_package/ncbi_dataset/data/data_report.jsonl
  dataformat tsv gene --package genes.zip

Flags
      --fields strings     Comma-separated list of fields (default annotation-assembly-accession,annotation-assembly-name,annotation-genomic-range-accession,annotation-genomic-range-exon-order,annotation-genomic-range-exon-orientation,annotation-genomic-range-exon-start,annotation-genomic-range-exon-stop,annotation-genomic-range-range-order,annotation-genomic-range-range-orientation,annotation-genomic-range-range-start,annotation-genomic-range-range-stop,annotation-genomic-range-seq-name,annotation-release-date,annotation-release-name,chromosomes,common-name,description,ensembl-geneids,gene-id,gene-type,genomic-region-gene-range-accession,genomic-region-gene-range-range-order,genomic-region-gene-range-range-orientation,genomic-region-gene-range-range-start,genomic-region-gene-range-range-stop,genomic-region-genomic-region-type,group-id,group-method,name-authority,name-id,omim-ids,orientation,protein-count,ref-standard-gene-range-accession,ref-standard-gene-range-range-order,ref-standard-gene-range-range-orientation,ref-standard-gene-range-range-start,ref-standard-gene-range-range-stop,ref-standard-genomic-region-type,replaced-gene-id,rna-type,swissprot-accessions,symbol,synonyms,tax-id,tax-name,transcript-count)
                               - annotation-assembly-accession
                               - annotation-assembly-name
                               - annotation-genomic-range-accession
                               - annotation-genomic-range-exon-order
                               - annotation-genomic-range-exon-orientation
                               - annotation-genomic-range-exon-start
                               - annotation-genomic-range-exon-stop
                               - annotation-genomic-range-range-order
                               - annotation-genomic-range-range-orientation
                               - annotation-genomic-range-range-start
                               - annotation-genomic-range-range-stop
                               - annotation-genomic-range-seq-name
                               - annotation-release-date
                               - annotation-release-name
                               - chromosomes
                               - common-name
                               - description
                               - ensembl-geneids
                               - gene-id
                               - gene-type
                               - genomic-region-gene-range-accession
                               - genomic-region-gene-range-range-order
                               - genomic-region-gene-range-range-orientation
                               - genomic-region-gene-range-range-start
                               - genomic-region-gene-range-range-stop
                               - genomic-region-genomic-region-type
                               - group-id
                               - group-method
                               - name-authority
                               - name-id
                               - omim-ids
                               - orientation
                               - protein-count
                               - ref-standard-gene-range-accession
                               - ref-standard-gene-range-range-order
                               - ref-standard-gene-range-range-orientation
                               - ref-standard-gene-range-range-start
                               - ref-standard-gene-range-range-stop
                               - ref-standard-genomic-region-type
                               - replaced-gene-id
                               - rna-type
                               - swissprot-accessions
                               - symbol
                               - synonyms
                               - tax-id
                               - tax-name
                               - transcript-count
  -h, --help               help for gene
      --inputfile string   Input file (default "ncbi_dataset/data/data_report.jsonl")
      --package string     Data package (zip archive), inputfile parameter is relative to the root path inside the archive



Global Flags
      --elide-header   Do not output header
      --force          Force dataformat to run without type check prompt

  • 实例
(base) [yut@io02 ~]$ datasets summary gene gene-id 672  --as-json-lines |dataformat tsv gene
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Annotation Assembly Accession   Annotation Assembly Name        Annotation Genomic Range Accession      Annotation Genomic Range Exons Order    Annotation Genomic Range Exons Orientation      Annotation Genomic Range Exons Start    Annotation Genomic Range Exons Stop     Annotation Genomic Range Order        Annotation Genomic Range Orientation    Annotation Genomic Range Start  Annotation Genomic Range Stop   Annotation Genomic Range Seq Name       Annotation Release Date Annotation Release Name Chromosomes     Common Name     Description     Ensembl GeneIDs NCBI GeneID   Gene Type       Genomic Region Gene Range Sequence Accession    Genomic Region Gene Range Order Genomic Region Gene Range Orientation   Genomic Region Gene Range Start Genomic Region Gene Range Stop  Genomic Region Genomic Region Type      Gene Group Identifier   Gene Group Method     Nomenclature Authority  Nomenclature ID OMIM IDs        Orientation     Proteins        Reference Standard Gene Range Sequence Accession        Reference Standard Gene Range Order     Reference Standard Gene Range Orientation       Reference Standard Gene Range Start     Reference Standard Gene Range Stop    Reference Standard Genomic Region Type  Replaced NCBI GeneID    RNA Type        SwissProt Accessions    Symbol  Synonyms        Taxonomic ID    Taxonomic Name  Transcripts
GCF_000001405.40        GRCh38.p14      NC_000017.11                                            minus   43044295        43170327        17      2023-10-02      GCF_000001405.40-RS_2023_10     17      human   BRCA1 DNA repair associated     ENSG00000012048 672     PROTEIN_CODING       672      NCBI Ortholog   HGNC    HGNC:1100       113705  minus   368     NG_005905.2             plus    92501   173689  REFSEQ_GENE                     P38398  BRCA1   IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606    Homo sapiens    368
GCF_009914755.1 T2T-CHM13v2.0   NC_060941.1                                             minus   43902857        44029084        17      2023-10-02      GCF_009914755.1-RS_2023_10      17      human   BRCA1 DNA repair associated     ENSG00000012048 672     PROTEIN_CODING               672      NCBI Ortholog   HGNC    HGNC:1100       113705  minus   368     NG_005905.2             plus    92501   173689  REFSEQ_GENE                     P38398  BRCA1   IRIS,PSCP,BRCAI,BRCC1,FANCS,PNCA4,RNF53,BROVCA1,PPP1R53 9606    Homo sapiens    368

(base) [yut@io02 ~]$ datasets summary gene gene-id 672  --as-json-lines |dataformat tsv gene --fields gene-id,gene-type,symbol
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
NCBI GeneID     Gene Type       Symbol
672     PROTEIN_CODING  BRCA1

# --as-json-lines必须加上
# --fields指定需要的字段,多个空格隔开

通过json文件解析其他字段

  • 某些字段无法通过dataformat提取出来,可先保存成json文件,然后通过下面脚本解析:
(base) [yut@node01 ~]$ cat dataset.json
{"accession":"GCA_013141435.1","annotation_info":{"method":"Best-placed reference protein set; GeneMarkS-2+","name":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","pipeline":"NCBI Prokaryotic Genome Annotation Pipeline (PGAP)","provider":"NCBI","release_date":"2020-05-14","software_version":"4.11","stats":{"gene_counts":{"non_coding":27,"protein_coding":2566,"pseudogene":17,"total":2610}}},"assembly_info":{"assembly_level":"Contig","assembly_method":"MetaSPAdes v. 3.10.1","assembly_name":"ASM1314143v1","assembly_status":"current","assembly_type":"haploid","bioproject_accession":"PRJNA622654","bioproject_lineage":[{"bioprojects":[{"accession":"PRJNA622654","title":"Metagenomic profiling of ammonia and methane-oxidizing microorganisms in a Dutch drinking water treatment plant"}]}],"biosample":{"accession":"SAMN14539096","attributes":[{"name":"isolation_source","value":"Primary rapid sand filter"},{"name":"collection_date","value":"not applicable"},{"name":"geo_loc_name","value":"Netherlands"},{"name":"lat_lon","value":"not applicable"},{"name":"isolate","value":"P-RSF-IL-07"},{"name":"depth","value":"not applicable"},{"name":"env_broad_scale","value":"drinking water treatment plant"},{"name":"env_local_scale","value":"Primary rapid sand filter"},{"name":"env_medium","value":"not applicable"},{"name":"metagenomic","value":"1"},{"name":"environmental-sample","value":"1"},{"name":"sample_type","value":"metagenomic assembly"},{"name":"metagenome-source","value":"drinking water metagenome"},{"name":"derived_from","value":"This BioSample is a metagenomic assembly obtained from the drinking water metagenome BioSample:SAMN14524263, SAMN14524264, SAMN14524265, SAMN14524266"}],"bioprojects":[{"accession":"PRJNA622654"}],"description":{"comment":"Keywords: GSC:MIxS;MIMAG:6.0","organism":{"organism_name":"Ferruginibacter sp.","tax_id":1940288},"title":"MIMAG Metagenome-assembled Genome sample from Ferruginibacter sp."},"last_updated":"2020-05-19T00:50:12.857","models":["MIMAG.water"],"owner":{"contacts":[{}],"name":"Radboud University"},"package":"MIMAG.water.6.0","publication_date":"2020-05-19T00:50:12.857","sample_ids":[{"label":"Sample name","value":"Ferruginibacter sp. P-RSF-IL-07"}],"status":{"status":"live","when":"2020-05-19T00:50:12.857"},"submission_date":"2020-04-04T13:17:04.950"},"comments":"The annotation was added by the NCBI Prokaryotic Genome Annotation Pipeline (PGAP). Information about PGAP can be found here: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/","genome_notes":["derived from metagenome"],"release_date":"2020-05-21","sequencing_tech":"Illumina MiSeq","submitter":"Radboud University"},"assembly_stats":{"contig_l50":10,"contig_n50":104094,"gc_count":"978119","gc_percent":32,"genome_coverage":"270.4x","number_of_component_sequences":43,"number_of_contigs":43,"total_sequence_length":"3056910","total_ungapped_length":"3056910"},"average_nucleotide_identity":{"best_ani_match":{"ani":79.65,"assembly":"GCA_003426875.1","assembly_coverage":0.01,"category":"type","organism_name":"Lutibacter oceani","type_assembly_coverage":0.01},"category":"category_na","comment":"na","match_status":"low_coverage","submitted_organism":"Ferruginibacter sp.","submitted_species":"Ferruginibacter sp.","taxonomy_check_status":"Inconclusive"},"current_accession":"GCA_013141435.1","organism":{"infraspecific_names":{"isolate":"P-RSF-IL-07"},"organism_name":"Ferruginibacter sp.","tax_id":1940288},"source_database":"SOURCE_DATABASE_GENBANK","wgs_info":{"master_wgs_url":"https://www.ncbi.nlm.nih.gov/nuccore/JABFQZ000000000.1","wgs_contigs_url":"https://www.ncbi.nlm.nih.gov/Traces/wgs/JABFQZ01","wgs_project_accession":"JABFQZ01"}}

(base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py  *json
Save result in output.csv
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly
(base) [yut@node01 ~]$ cat ~/Software/Important_scripts/Parse_dataset_genome_json_metadata.py
#!/usr/bin/env python
import argparse
import json
import pandas as pd

# 创建参数解析器
parser = argparse.ArgumentParser(description='Parse JSON data')
parser.add_argument('json_file', help='Path to the JSON file')

# 解析参数
args = parser.parse_args()

# 读取JSON文件
with open(args.json_file, 'r') as file:
    json_str = file.read()

# 解析JSON
data = json.loads(json_str)

# 获取env_broad_scale字段的值
# 获取所需字段的值
accession = data["accession"]
geo_loc_name = data["assembly_info"]["biosample"]["attributes"][2]["value"]
lat_lon = data["assembly_info"]["biosample"]["attributes"][3]["value"]
collection_date = data['assembly_info']['biosample']['attributes'][1]['value']
env_broad_scale = data["assembly_info"]["biosample"]["attributes"][6]["value"]
env_local_scale = data['assembly_info']['biosample']['attributes'][7]['value']
env_medium = data['assembly_info']['biosample']['attributes'][8]['value']
sample_type = data['assembly_info']['biosample']['attributes'][11]['value']

# output
# 创建DataFrame
df = pd.DataFrame({
    'Accession': [accession],
    'Geo Location Name': [geo_loc_name],
    'Latitude and Longitude': [lat_lon],
    'Collection date' : [collection_date],
    'Env broad scale' : [env_broad_scale],
    'Env local scale'  : [env_local_scale],
    'Env medium' : [env_medium],
    'Sample type' : [sample_type]
})

# 将DataFrame保存为CSV文件
df.to_csv('output.csv', index=False)
print('Save result in output.csv ')

# run
$ (base) [yut@node01 ~]$ Parse_dataset_genome_json_metadata.py  dataset.json
Save result in output.csv
# output
(base) [yut@node01 ~]$ cat output.csv
Accession,Geo Location Name,Latitude and Longitude,Collection date,Env broad scale,Env local scale,Env medium,Sample type
GCA_013141435.1,Netherlands,not applicable,not applicable,drinking water treatment plant,Primary rapid sand filter,not applicable,metagenomic assembly

问题

  • Error: Internal error (invalid zip archive)并且没有输出文件
(base) [yut@io02 02_Glacier_new_taxa]$ time datasets download genome accession --inputfile GTDB_R214_63_Ferruginibacter_genus.GCA --include gff3,rna,cds,protein,genome,seq-report
New version of client (15.27.1) available at https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets
Collecting 63 genome records [================================================] 100% 63/63
Downloading: ncbi_dataset.zip    146MB done
Validating package files [===========>------------------------------------]  28% 70/252
Error: Internal error (invalid zip archive). Please try again

上述问题可能是输入编号既包括GCA又包括GCF编号,解决办法是将两者分开下载,或者等到Validating package files停掉命令

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1217352.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

FISCOBCOS入门(十)Truffle自定义测试helloworld

在windos终端内安装truffle npm install -g truffle truffle --version 出现上图情况也没问题 下面就可以进行我们的操作了 创建一个文件truffle 创建一个空工程 truffle init 在contracts内加入HelloWorld合约 // SPDX-License-Identifier: MIT pragma solidity ^0.8.0; c…

瑞吉外卖Day05

1.新增套餐 service层 Service public class SetmealServiceImpl extends ServiceImpl<SetmealMapper, Setmeal> implements SetmealService {Autowiredprivate SetmealDishService setmealDishService;/*** 新增套餐&#xff0c;同时保存套餐和菜品的关系** param set…

【你哥电力电子】从耦合电感到变压器

从耦合电感到变压器 2023年7月12日 dk 文章目录 从耦合电感到变压器1. 耦合电感1.1 一个等效1.2 自感、互感与漏感1.3 耦合系数2. 变压器3. 其他模型的推导方法3.1 T型等效电路3.2 其他等效电路4. 小结下链1. 耦合电感 1.1 一个等效 通电导线的周围会产生磁场,磁场可以通过…

IntelliJ IDE 插件开发 |(一)快速入门

前言 IntelliJ IDEA 作为 Java 开发的首选 IDE&#xff0c;其强大、方便之处不必多说。不过&#xff0c;由于个人或者团队的个性化需求&#xff0c;我们或多或少会想对其功能进行拓展&#xff0c;这时就需要开发插件&#xff08;在 IntelliJ 平台下的所有 IDE 均可运行&#x…

SpringBoot配置数据库密码加密的方法

由于系统安全的考虑,配置文件中不能出现明文密码的问题,本文就给大家详细介绍下springboot配置数据库密码加密的方法,下面话不多说了,来一起看看详细的介绍吧,需要的朋友可以参考下 1.导入依赖 <!--数据库密码加密--> <dependency><groupId>com.github.uli…

Greek Alphabet Letters Symbols

Upper CaseLower CaseGreek Letter NameEnglish EquivalentSoundΑαAlphaa ΒβBetab ΓγGammag ΔδDeltad ΕεEpsilone ΖζZetaz ΗηEtah ΘθThetath ΙιIotai ΚκKappak ΛλLambdal ΜμMum ΝνNun ΞξXix ΟοOmicrono ΠπPip ΡρRhor Σσ,…

Apache SCXML2 RCE漏洞

文章目录 前言源码利用上传恶意xml文件构造payload搭建Apache服务器 远程RCE 前言 在做 [HDCTF 2023]BabyJxVx 遇到的知识点&#xff0c;但是没公网的服务器只能作罢&#xff0c;写下这篇文章记录 源码利用 public String Flag(RequestParam(required true) String filenam…

Leetcode刷题详解——扫雷游戏

1. 题目链接&#xff1a;529. 扫雷游戏 2. 题目描述&#xff1a; 让我们一起来玩扫雷游戏&#xff01; 给你一个大小为 m x n 二维字符矩阵 board &#xff0c;表示扫雷游戏的盘面&#xff0c;其中&#xff1a; M 代表一个 未挖出的 地雷&#xff0c;E 代表一个 未挖出的 空方…

基于django电影推荐系统

基于django电影推荐系统 摘要 该Django电影推荐系统是一个简单而基础的框架&#xff0c;旨在展示系统的基本组件。系统包括两个主要模型&#xff0c;即Movie和Rating&#xff0c;用于存储电影信息和用户评分。视图层包括展示电影列表和电影详情的功能&#xff0c;使用模板进行页…

linux高级篇基础理论二(详细文档、LAMP、SHELL、sed正则表达式)

♥️作者&#xff1a;小刘在C站 ♥️个人主页&#xff1a; 小刘主页 ♥️不能因为人生的道路坎坷,就使自己的身躯变得弯曲;不能因为生活的历程漫长,就使求索的 脚步迟缓。 ♥️学习两年总结出的运维经验&#xff0c;以及思科模拟器全套网络实验教程。专栏&#xff1a;云计算技…

大数据架构Lambda-架构师(六十九)

随着信息时代技术的发展&#xff0c;数据量的快速增加逐渐飙升到了惊人的数量级别。并且数据的采集与处理技术还在更新加快。大数据中&#xff0c;结构化占比百分之15左右&#xff0c;其余百分之85都是非结构化数据&#xff0c;他们大量存在于社交网络、互联网和电子商务等领域…

深入理解锁

目录 常用锁策略 1.乐观锁 VS 悲观锁 2.轻量级锁 VS 重量级锁 3.自旋锁 VS 挂起等待锁 4.互斥锁 VS 读写锁 5.公平锁 VS 非公平锁 6.可重入锁 VS 可重入锁 CAS ABA问题 Synchronized原理 1. 锁升级/锁膨胀 2.锁消除 3.锁粗化 常用锁策略 1.乐观锁 VS 悲观锁 站在…

【Java实现图书管理系统】

图书管理系统 1. 设计背景2. 设计思路3. 模块展示代码演示3.1 Book类3.2 BookList类&#xff08;书架类&#xff09;3.4 用户类 - User类3.5 子类管理员类 -- AdminUser类3.6 子类普通用户类 -- NormalUser类3.7 操作接口3.8 操作类3.8.1 查找操作 -- FindOperation类3.8.2 增加…

PostgreSQL基于Citus实现的分布式集群

&#x1f4e2;&#x1f4e2;&#x1f4e2;&#x1f4e3;&#x1f4e3;&#x1f4e3; 哈喽&#xff01;大家好&#xff0c;我是【IT邦德】&#xff0c;江湖人称jeames007&#xff0c;10余年DBA及大数据工作经验 一位上进心十足的【大数据领域博主】&#xff01;&#x1f61c;&am…

LCD1602指定位置显示字符串-详细版

本文为博主 日月同辉&#xff0c;与我共生&#xff0c;csdn原创首发。希望看完后能对你有所帮助&#xff0c;不足之处请指正&#xff01;一起交流学习&#xff0c;共同进步&#xff01; > 发布人&#xff1a;日月同辉,与我共生_单片机-CSDN博客 > 欢迎你为独创博主日月同…

Hive数据表操作--学习笔记

1&#xff0c;Hive数据表操作 1&#xff0c;建表语句和内外部表 ①创建内部表 create [external] table [if not exists] 表名( 字段名 字段类型 [comment 注释], 字段名 字段类型 [comment 注释], ... ) [row format delimited fields terminated by 指定分隔符];&#xff0…

深度学习之基于YoloV5血红细胞检测识别系统

欢迎大家点赞、收藏、关注、评论啦 &#xff0c;由于篇幅有限&#xff0c;只展示了部分核心代码。 文章目录 一项目简介 二、功能三、系统四. 总结 一项目简介 深度学习已经在许多领域中得到了广泛的应用&#xff0c;包括医疗健康领域。其中&#xff0c;YOLO&#xff08;You O…

App加固中的代码混淆功能,让逆向工程师很头疼

App加固中的代码混淆功能&#xff0c;让逆向工程师很头疼 “我想离开浪浪山。” 在数次尝试破解某个App 时&#xff0c;某个逆向工程师无奈感慨道。 逆向工程师顾名思义就是把一个个完整的软件逆推&#xff0c;还原成一段段代码&#xff0c;方便破解。 比如给他们一个手机Ap…

行内样式、内部样式、外部样式

行内样式&#xff1a; 该元素的所在本行中使用style标记来写样式 内部样式&#xff1a; 在head标签中使用style标记来写样式 外部样式&#xff1a; 在head标签中使用link标记引用外部样式 注意优先级&#xff1a; 行内样式&#xff1e;内部样式&#xff1e;外部样式 代码…

ChatGPT简介及基本概念

点击跳转专栏>Unity3D特效百例点击跳转专栏>案例项目实战源码点击跳转专栏>游戏脚本-辅助自动化点击跳转专栏>Android控件全解手册点击跳转专栏>Scratch编程案例点击跳转>软考全系列点击跳转>蓝桥系列点击跳转>ChatGPT和AIGC &#x1f449;关于作者 专…