Building a TxDb object

Here is an example of how to build resources from Gene transfer format (GTF) files.

Structure is as GFF, so the fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

We can view the information in the GTF file we downloaded here.

/home/USER/db/refanno/gencode.v33.annotation.gtf
##description: evidence-based annotation of the human genome (GRCh38), version 33 (Ensembl 99)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-12-13
chr1	HAVANA	gene	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";

In /home/USER/db/refanno we execute R:

library(GenomicFeatures)

gtf <- "gencode.v33.annotation.gtf"
txdb.filename <- "gencode.v33.annotation.sqlite"

txdb <- makeTxDbFromGFF(gtf)

# We can use saveDb() to save the TxDb database (SQLite database) for later uses
saveDb(txdb, txdb.filename)

# We can use loadDb() to use the TxDb database
txdb <- loadDb(txdb.filename)
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gencode.v33.annotation.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 227912
# exon_nrow: 747278
# cds_nrow: 275239
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2020-04-27 15:47:27 +0100 (Mon, 27 Apr 2020)
# GenomicFeatures version at creation time: 1.38.2
# RSQLite version at creation time: 2.2.0
# DBSCHEMAVERSION: 1.2
> genes(txdb)
GRanges object with 60662 ranges and 1 metadata column:
                     seqnames              ranges strand |            gene_id
                        <Rle>           <IRanges>  <Rle> |        <character>
  ENSG00000000003.15     chrX 100627108-100639991      - | ENSG00000000003.15
   ENSG00000000005.6     chrX 100584936-100599885      + |  ENSG00000000005.6
  ENSG00000000419.12    chr20   50934867-50958555      - | ENSG00000000419.12
  ENSG00000000457.14     chr1 169849631-169894267      - | ENSG00000000457.14
  ENSG00000000460.17     chr1 169662007-169854080      + | ENSG00000000460.17
                 ...      ...                 ...    ... .                ...
   ENSG00000288584.1     chr6 164148022-164152175      + |  ENSG00000288584.1
   ENSG00000288585.1     chr3 141449745-141456434      - |  ENSG00000288585.1
   ENSG00000288586.1     chr9   35603437-35605139      - |  ENSG00000288586.1
   ENSG00000288587.1     chr6   31400702-31463705      + |  ENSG00000288587.1
   ENSG00000288588.1     chr4     6245563-6261639      + |  ENSG00000288588.1
  -------
  seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths

Last updated