1 GenBank
1.1 NCBI——美国国家生物技术信息中心(美国国立生物技术信息中心)
NCBI(美国国立生物技术信息中心)是在NIH的国立医学图书馆(NLM)的一个分支。它的使命包括四项任务:1. 建立关于分子生物学,生物化学,和遗传学知识的存储和分析的自动系统 ;2.实行关于用于分析生物学重要分子和复合物的结构和功能的基于计算机的信息处理的,先进方法的研究;3. 加速生物技术研究者和医药治疗人员对数据库和软件的使用;4. 全世界范围内的生物技术信息收集的合作努力。NCBI数据库由Nucleotide(核苷酸序列数据库)、 Genome(基因组数据库)、Structure(结构数据库或称分子模型数据库)、Taxonomy(生物学门类数据库)、 PopSet几个子库组成。
美国国立生物技术信息中心(National Center for Biotechnology Information),是由美国国立卫生研究院(NIH)于1988年创办。创办NCBI的初衷是为了给分子生物学家提供一个信息储存和处理的系统。除了建有GenBank核酸序列数据库(该数据库的数据资源来自全球几大DNA数据库,其中包括日本DNA数据库DDBJ、欧洲分子生物学实验室数据库EMBL以及其它几个知名科研机构)之外,NCBI还可以提供众多功能强大的数据检索与分析工具。目前,NCBI提供的资源有Entrez、Entrez Programming Utilities、My NCBI、PubMed、PubMed Central、Entrez Gene、NCBI Taxonomy Browser、BLAST、BLAST Link (BLink)、Electronic PCR等共计36种功能,而且都可以在NCBI的主页www.ncbi.nlm.nih.gov上找到相应链接,其中多半是由BLAST功能发展而来的。
1.2 GenBank DNA数据库
GenBank是美国国家生物技术信息中心(National Center for Biotechnology Information ,NCBI)建立的DNA序列数据库,从公共资源中获取序列数据,主要是科研人员直接提供或来源于大规模基因组测序计划( Benson等, 1998)。为保证数据尽可能的完全,GenBank与EMBL(欧洲EMBL-DNA数据库)、DDBJ(日本DNA数据库:DNA Data Bank of Japan)建立了相互交换数据的合作关系。
GenBank文件就是NCBI支持的主要生信格式。读懂 GenBank 后 EMBL 就很简单了。
GenBank格式是最早和最古老的生物信息学数据格式之一,最初的发明是为了弥补人类可读的表达方式和可被计算机有效处理的表达方式之间的差距,为人类阅读而优化的,不适合大规模的数据处理。该格式有一个所谓的固定宽度格式,前十个字符组成一列,作为一个标识符,其余的行是与该标识符相对应的信息。
2 GenBank Overview
What is GenBank?
GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
A GenBank release occurs every two months and is available from the ftp site. The release notes for the current version of GenBank provide detailed information about the release and notifications of upcoming changes to GenBank. Release notes for previous GenBank releases are also available. GenBank growth statistics for both the traditional GenBank divisions and the WGS division are available from each release.
An annotated sample GenBank record for a Saccharomyces cerevisiae gene demonstrates many of the features of the GenBank flat file format.
Access to GenBank
There are several ways to search and retrieve data from GenBank.
Search GenBank for sequence identifiers and annotations with Entrez Nucleotide.
Search and align GenBank sequences to a query sequence using BLAST (Basic Local Alignment Search Tool). See BLAST info for more information about the numerous BLAST databases.
Search, link, and download sequences programatically using NCBI e-utilities.
The ASN.1 and flatfile formats are available at NCBI's anonymous FTP server: ftp://ftp.ncbi.nlm.nih.gov/ncbi-asn1 and ftp://ftp.ncbi.nlm.nih.gov/genbank.
GenBank Data Usage
The GenBank database is designed to provide and encourage access within the scientific community to the most up-to-date and comprehensive DNA sequence information. Therefore, NCBI places no restrictions on the use or distribution of the GenBank data. However, some submitters may claim patent, copyright, or other intellectual property rights in all or a portion of the data they have submitted. NCBI is not in a position to assess the validity of such claims, and therefore cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in GenBank.
Data Processing, Status and Release
The most important source of new data for GenBank is direct submissions from a variety of individuals, including researchers, using one of our submission tools. Following submission, data are subject to automated and manual processing to ensure data integrity and quality and are subsequently made available to the public. On rare occasions, data may be removed from public view. More details about this process can be found on the NLM GenBank and SRA Data Processing.
Confidentiality
Some authors are concerned that the appearance of their data in GenBank prior to publication will compromise their work. GenBank will, upon request, withhold release of new submissions for a specified period of time. However, if the accession number or sequence data appears in print or online prior to the specified date, your sequence will be released. In order to prevent the delay in the appearance of published sequence data, we urge authors to inform us of the appearance of the published data. As soon as it is available, please send the full publication data--all authors, title, journal, volume, pages and date--to the following address: update@ncbi.nlm.nih.gov
Privacy
If you are submitting human sequences to GenBank, do not include any data that could reveal the personal identity of the source. GenBank assumes that the submitter has received any necessary informed consent authorizations required prior to submitting sequences.
3 GenBank Parser解释器C#源代码
using System;
using System.IO;
using System.Text;
using System.Linq;
using System.Drawing;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.Serialization;
namespace Legal.BIOG
{
[DataContract]
public class GENBANK_ELEMENT
{
[DataMember(Order = 1)]
public string Name { get; set; } = "";
[DataMember(Order = 2)]
public string Content { get; set; } = "";
public GENBANK_ELEMENT(int position, string buf)
{
Name = buf.Substring(0, position).Trim();
Content = buf.Substring(position).Trim();
}
public GENBANK_ELEMENT(string name, string content)
{
Name = name;
Content = content;
}
}
[DataContract]
public class GENBANK_REFERENCE
{
[DataMember(Order = 1)]
public string Name { get; set; } = "";
[DataMember(Order = 2)]
public List<GENBANK_ELEMENT> Items { get; set; } = new List<GENBANK_ELEMENT>();
public void Append(int position, string buf)
{
Items.Add(new GENBANK_ELEMENT(position, buf));
}
public void Append(string name, string content)
{
Items.Add(new GENBANK_ELEMENT(name, content));
}
public GENBANK_ELEMENT Find(string name)
{
return Items.Find(t => t.Name == name);
}
}
[DataContract]
public class GENBANK_FEATURE
{
[DataMember(Order = 1)]
public string Name { get; set; } = "";
[DataMember(Order = 2)]
public string Lines { get; set; } = "";
public GENBANK_FEATURE(string name, string lines)
{
Name = name;
Lines = lines;
}
public List<string> FeatureList
{
get
{
string[] ra = B.S2L(Lines);
return ra.ToList();
}
}
/// <summary>
/// 搜索 FEATURE 项目
/// 比如:/db_xref=
/// </summary>
/// <param name="name">db_xref</param>
/// <param name="branch">db_xref</param>
/// <returns></returns>
public string FindBranch(string name, string branch)
{
List<string> list = FeatureList;
if (Name == name)
{
foreach (string s in list)
{
if (s.StartsWith("/" + branch + "="))
{
return s.Substring(branch.Length + 2);
}
}
}
return "";
}
public string Position
{
get
{
List<string> list = FeatureList;
return (list[0].Contains("..")) ? list[0] : "";
}
}
public List<Point> PositionList
{
get
{
return Utility.PositionList(Position);
}
}
}
[DataContract]
public class GENBANK_Item
{
[DataMember(Order = 1)]
public List<GENBANK_ELEMENT> Descriptions { get; set; } = new List<GENBANK_ELEMENT>();
[DataMember(Order = 2)]
public List<GENBANK_REFERENCE> References { get; set; } = new List<GENBANK_REFERENCE>();
[DataMember(Order = 3)]
public List<GENBANK_REFERENCE> Source { get; set; } = new List<GENBANK_REFERENCE>();
[DataMember(Order = 4)]
public List<GENBANK_FEATURE> Features { get; set; } = new List<GENBANK_FEATURE>();
public string Find(string name)
{
GENBANK_ELEMENT de = Descriptions.Find(t => t.Name == name);
return (de.Name == name) ? de.Content : "";
}
public string Sequence
{
get
{
GENBANK_ELEMENT sq = Descriptions.Find(t => t.Name == "ORIGIN");
return (sq.Name == "ORIGIN") ? (sq.Content) : "";
}
}
}
public class GENBANK_File
{
public List<GENBANK_Item> Items { get; set; } = new List<GENBANK_Item>();
public GENBANK_File(string buf)
{
try
{
string[] xlines = B.S2L(buf);
GENBANK_Item item = null;
for (int i = 0; i < xlines.Length; i++)
{
if (xlines[i].StartsWith("LOCUS"))
{
if (item != null) { Items.Add(item); item = null; }
item = new GENBANK_Item();
item.Descriptions.Add(new GENBANK_ELEMENT(12, xlines[i]));
continue;
}
if (xlines[i].StartsWith("DEFINITION") ||
xlines[i].StartsWith("ACCESSION") ||
xlines[i].StartsWith("VERSION") ||
xlines[i].StartsWith("KEYWORDS") ||
xlines[i].StartsWith("COMMENT"))
{
string rs = Utility.ReadMultiLines(ref i, xlines, out string kw, 12, " ");
item.Descriptions.Add(new GENBANK_ELEMENT(kw, rs));
continue;
}
else if (xlines[i].StartsWith("SOURCE"))
{
GENBANK_REFERENCE src = new GENBANK_REFERENCE();
src.Name = xlines[i].Substring(12).Trim(); i++;
while (true)
{
string rs = Utility.ReadMultiLines(ref i, xlines, out string kw, 12);
src.Append(kw, rs);
if (xlines[i + 1].Substring(0, 1) != " ") { break; }
i++;
}
item.Source.Add(src);
continue;
}
else if (xlines[i].StartsWith("REFERENCE"))
{
GENBANK_REFERENCE rfx = new GENBANK_REFERENCE();
rfx.Name = xlines[i].Substring(12).Trim(); i++;
while (true)
{
string rs = Utility.ReadMultiLines(ref i, xlines, out string kw, 12);
rfx.Append(kw, rs);
if (xlines[i + 1].Substring(0, 1) != " ") { break; }
i++;
}
item.References.Add(rfx);
continue;
}
else if (xlines[i].StartsWith("FEATURES"))
{
item.Descriptions.Add(new GENBANK_ELEMENT("FEATURES", xlines[i].Substring(12).Trim())); i++;
while (true)
{
string rs = Utility.ReadFeatureLines(ref i, xlines, out string kw, 0, 21);
GENBANK_FEATURE ef = new GENBANK_FEATURE(kw, rs);
item.Features.Add(ef);
if (xlines[i + 1].Substring(0, 1) != " ") { break; }
i++;
}
continue;
}
else if (xlines[i].StartsWith("//"))
{
if (item != null) { Items.Add(item); item = null; }
continue;
}
else if (xlines[i].StartsWith("ORIGIN"))
{
i++;
string rs = Utility.ReadSequenceLines(ref i, xlines, 10);
item.Descriptions.Add(new GENBANK_ELEMENT("ORIGIN", rs));
continue;
}
else
{
item.Descriptions.Add(new GENBANK_ELEMENT("UNKNOW", xlines[i]));
continue;
}
}
if (item != null)
{
Items.Add(item);
}
}
catch (Exception ex)
{
throw new Exception("GENBANK_File() ERROR: " + ex.Message);
}
}
public static GENBANK_File FromFile(string filename)
{
try
{
string buf = File.ReadAllText(filename);
return new GENBANK_File(buf);
}
catch (Exception ex)
{
throw new Exception("GENBANK_File() ERROR: " + ex.Message);
}
}
public void Write_Json(string filename)
{
try
{
File.WriteAllText(filename, SimpleJson.SerializeObject(Items));
}
catch (Exception ex)
{
throw new Exception("GENBANK_File.Write_Json ERROR: " + ex.Message);
}
}
public string Fasta_Sequences()
{
StringBuilder sb = new StringBuilder();
foreach (GENBANK_Item item in Items)
{
sb.AppendLine(">" + item.Find("DEFINITION"));
sb.AppendLine(B.BreakTo(item.Sequence));
sb.AppendLine("");
}
return sb.ToString();
}
public string Print_Features()
{
StringBuilder sb = new StringBuilder();
foreach (GENBANK_Item item in Items)
{
foreach (GENBANK_FEATURE feature in item.Features)
{
if (feature.FeatureList.Count > 1)
{
sb.AppendLine(">" + feature.Name + " " + feature.FeatureList[1]);
sb.AppendLine(B.BreakTo(Utility.SequenceByPosition(item.Sequence, feature.PositionList)));
sb.AppendLine("");
}
}
}
return sb.ToString();
}
public string Protein()
{
StringBuilder sb = new StringBuilder();
foreach (GENBANK_Item item in Items)
{
foreach (GENBANK_FEATURE feature in item.Features)
{
string tr = feature.FindBranch("CDS", "translation");
if (tr.Length > 0)
{
sb.AppendLine(">" + feature.Name + " " + feature.FeatureList[1]);
sb.AppendLine(B.BreakTo(tr.Replace(" ", "").Replace("\"", "")));
sb.AppendLine("");
}
}
}
return sb.ToString();
}
}
}