Building a TxDb object
Here is an example of how to build resources from Gene transfer format (GTF) files.
Structure is as GFF, so the fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
We can view the information in the GTF file we downloaded here.
/home/USER/db/refanno/gencode.v33.annotation.gtf
##description: evidence-based annotation of the human genome (GRCh38), version 33 (Ensembl 99)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-12-13
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
In /home/USER/db/refanno
we execute R
:
library(GenomicFeatures)
gtf <- "gencode.v33.annotation.gtf"
txdb.filename <- "gencode.v33.annotation.sqlite"
txdb <- makeTxDbFromGFF(gtf)
# We can use saveDb() to save the TxDb database (SQLite database) for later uses
saveDb(txdb, txdb.filename)
# We can use loadDb() to use the TxDb database
txdb <- loadDb(txdb.filename)
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gencode.v33.annotation.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 227912
# exon_nrow: 747278
# cds_nrow: 275239
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2020-04-27 15:47:27 +0100 (Mon, 27 Apr 2020)
# GenomicFeatures version at creation time: 1.38.2
# RSQLite version at creation time: 2.2.0
# DBSCHEMAVERSION: 1.2
> genes(txdb)
GRanges object with 60662 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000000003.15 chrX 100627108-100639991 - | ENSG00000000003.15
ENSG00000000005.6 chrX 100584936-100599885 + | ENSG00000000005.6
ENSG00000000419.12 chr20 50934867-50958555 - | ENSG00000000419.12
ENSG00000000457.14 chr1 169849631-169894267 - | ENSG00000000457.14
ENSG00000000460.17 chr1 169662007-169854080 + | ENSG00000000460.17
... ... ... ... . ...
ENSG00000288584.1 chr6 164148022-164152175 + | ENSG00000288584.1
ENSG00000288585.1 chr3 141449745-141456434 - | ENSG00000288585.1
ENSG00000288586.1 chr9 35603437-35605139 - | ENSG00000288586.1
ENSG00000288587.1 chr6 31400702-31463705 + | ENSG00000288587.1
ENSG00000288588.1 chr4 6245563-6261639 + | ENSG00000288588.1
-------
seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths
Last updated