Building a TxDb object
Last updated
Was this helpful?
Last updated
Was this helpful?
Here is an example of how to build resources from Gene transfer format (GTF) files.
Structure is as GFF, so the fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]
We can view the information in the GTF file we downloaded .
##description: evidence-based annotation of the human genome (GRCh38), version 33 (Ensembl 99)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2019-12-13
chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 12613 12721 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 13221 14409 . + . gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_name "DDX11L1-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
In /home/USER/db/refanno
we execute R
:
library(GenomicFeatures)
gtf <- "gencode.v33.annotation.gtf"
txdb.filename <- "gencode.v33.annotation.sqlite"
txdb <- makeTxDbFromGFF(gtf)
# We can use saveDb() to save the TxDb database (SQLite database) for later uses
saveDb(txdb, txdb.filename)
# We can use loadDb() to use the TxDb database
txdb <- loadDb(txdb.filename)
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gencode.v33.annotation.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 227912
# exon_nrow: 747278
# cds_nrow: 275239
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2020-04-27 15:47:27 +0100 (Mon, 27 Apr 2020)
# GenomicFeatures version at creation time: 1.38.2
# RSQLite version at creation time: 2.2.0
# DBSCHEMAVERSION: 1.2
> genes(txdb)
GRanges object with 60662 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
ENSG00000000003.15 chrX 100627108-100639991 - | ENSG00000000003.15
ENSG00000000005.6 chrX 100584936-100599885 + | ENSG00000000005.6
ENSG00000000419.12 chr20 50934867-50958555 - | ENSG00000000419.12
ENSG00000000457.14 chr1 169849631-169894267 - | ENSG00000000457.14
ENSG00000000460.17 chr1 169662007-169854080 + | ENSG00000000460.17
... ... ... ... . ...
ENSG00000288584.1 chr6 164148022-164152175 + | ENSG00000288584.1
ENSG00000288585.1 chr3 141449745-141456434 - | ENSG00000288585.1
ENSG00000288586.1 chr9 35603437-35605139 - | ENSG00000288586.1
ENSG00000288587.1 chr6 31400702-31463705 + | ENSG00000288587.1
ENSG00000288588.1 chr4 6245563-6261639 + | ENSG00000288588.1
-------
seqinfo: 25 sequences (1 circular) from an unspecified genome; no seqlengths