Annotation File Preparation – Defining Genomic Regions

In this section, we will use the reference annotation files retrieved from public databases to generate several genomic positional templates in BED format. We will use these files in the calculation of DNA methylation at specific regions, such as promoter, intergenic, CpG islands, etc.

500 Chromosomal Bins

We will use the makewindows program of the bedtools utilities to subset the genome into 500 bins for each chromosome.

cd ~/

bsub -q 16G -o stdout -e stderr "bedtools makewindows -g /work3/NRPB1219/chromInfo.txt -n 500 -s 1 | gzip > ~/Data/hg18.500bins.bed.gz"

CpG Islands and Transcriptional Factor Binding Sites (TFBS)

The CpG island and TFBS annotation files downloaded from the UCSC genome browser is not in BED format. Hence we use awk to process them. We then pipe the output in the sortBed program of the bedtools utilities to make sure the BED files were properly coordinate-sorted.

cd ~/
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2,$3,"CpG|"$6,$8*10,"+" }' /work3/NRPB1219/cpgIslandExt.txt | sortBed | gzip > Data/cpgIslandExt.bed.gz
awk -F $'\t' 'BEGIN { OFS=FS } { if($6 >= 500) print $2,$3,$4,$5,$6,"+" }' /work3/NRPB1219/wgEncodeRegTfbsClustered.txt | sortBed | gzip > Data/wgEncodeRegTfbsClustered.bed.gz

The above command read the content of cpgIslandExt.txt :

chr1    18598   19673   CpG: 116        1075    116     787     21.6    73.2    0.83
chr1    124987  125426  CpG: 30 439     30      295     13.7    67.2    0.64
chr1    317653  318092  CpG: 29 439     29      295     13.2    67.2    0.62
chr1    427014  428027  CpG: 84 1013    84      734     16.6    72.5    0.64
chr1    439136  440407  CpG: 99 1271    99      777     15.6    61.1    0.84

And produce cpgIslandExt.bed in BED format:

chr1    18598   19673   CpG|116 216     +
chr1    124987  125426  CpG|30  137     +
chr1    317653  318092  CpG|29  132     +
chr1    427014  428027  CpG|84  166     +
chr1    439136  440407  CpG|99  156     +

And also read the content of wgEncodeRegTfbsClustered.txt :

585     chr1    51      504     ZBTB33  392     .       51      504     0       1       453     0       25
585     chr1    130     306     c-Jun   459     .       130     306     0       1       176     0       25
585     chr1    247     292     NFKB    1000    .       247     292     0       1       45      0       25
585     chr1    7125    8886    GR      99      .       7125    8886    0       1       1761    0       25
585     chr1    9914    10042   HEY1    44      .       9914    10042   0       1       128     0       25

Perform a filtering on the score (6th column) and produce wgEncodeRegTfbsClustered.bed in BED format:

chr1    247     292     NFKB    1000    +
chr1    81184   81397   Rad21   1000    +
chr1    227464  227809  Rad21   602     +
chr1    227505  227800  CTCF    1000    +
chr1    530491  530823  c-Myc   1000    +

Promoters, Exonic, Intronic, UTR and intergenic Regions

The gene annotation file we used here was curated by the GENCODE project (http://www.gencodegenes.org/). The GTF format is a standardized way to present gene annotation information of a genome (including gene, transcript, UTR, exon, start and stop codon).

head -n 15 /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf

##description: evidence-based annotation of the human genome (NCBI36), version 3c
##provider: GENCODE
##contact: fsk@sanger.ac.uk
##format: gtf
##date: 2009-10-01
chr1    HAVANA  gene    1737    4275    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1"; level 2; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL transcript      1737    4275    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL exon    1737    2090    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL exon    2476    2584    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL exon    3084    4275    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL CDS     4022    4249    .       +       0       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL start_codon     4022    4024    .       +       0       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL stop_codon      4250    4252    .       +       0       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL UTR     1737    2090    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1    ENSEMBL UTR     2476    2584    .       +       .       gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";

Please refer to online documentation for more information about the GTF format:

By executing the commands below, we will process the GTF file and generate genomic regions such as promoter regions, exonic, intronic, intergenic regions in BED format. The sortBed and mergeBed programs of the bedtools utilities was used to ensure the output is coordinate-sorted and overlapping regions were merged.

cd ~/

bsub -q 16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"exon\") print \$1,\$4-1,\$5 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | mergeBed -i - | gzip > Data/gencode.v3c.exon_merged.bed.gz"

!!! Use bjobs to check the job has completed, and ls to make sure this file was created before proceed.

ls -la Data/gencode.v3c.exon_merged.bed.gz

# Console output

-rw------- 1 s00yao00 s00yao00 2097963 2014-12-19 23:24 Data/gencode.v3c.exon_merged.bed.gz
cd ~/

bsub -q 16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"gene\") print \$1,\$4-1,\$5 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | subtractBed -a - -b Data/gencode.v3c.exon_merged.bed.gz | sortBed | mergeBed -i - | gzip > Data/gencode.v3c.intron_merged.bed.gz"

bsub -q16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"gene\") print \$1,\$4-1,\$5 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | complementBed -i - -g /work3/NRPB1219/chromInfo.txt | gzip > Data/gencode.v3c.intergenic.bed.gz"

bsub -q 16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"transcript\") print \$1,\$4-1,\$5,\$7,\$9 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sed 's/^\\([[:alnum:]]*\\)\\t\\([[:digit:]]*\\)\\t\\([[:digit:]]*\\)\\t\\([+|-]\\).*transcript_id \"\\([[:alnum:]]*\\)\".*/\\1\\t\\2\\t\\3\\t\\5\\t0\\t\\4/' | sortBed | gzip > Data/gencode.v3c.transcript.bed.gz"

!!! Use bjobs to check the all jobs have completed, and ls to make sure this file was created before proceed.

ls -la Data/gencode.v3c.transcript.bed.gz

# Console output

-rw------- 1 s00yao00 s00yao00 1478470 2014-12-19 23:33 Data/gencode.v3c.transcript.bed.gz

The promoter region defined here is the transcription start site (TSS) upstream 1000 bp & TSS downstream 500 bp.

cd ~/

bsub -q 16G -o stdout -e stderr "zcat Data/gencode.v3c.transcript.bed.gz | awk -F $'\t' 'BEGIN { OFS=FS }  { if(\$6 == \"+\") { left = \$2-1000; right = \$2+500 } else { left = \$3-500; right = \$3+1000 } if(left < 0) left = 0; print \$1,left,right }' | sortBed | mergeBed -i - | gzip > Data/gencode.v3c.promoter_merged.bed.gz"

The print_utr.pl is adapted from http://davetang.org/muse/2013/01/18/defining-genomic-regions/.

cd ~/

bsub -q 16G -o stdout -e stderr "perl /work3/NRPB1219/print_utr.pl /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | gzip > Data/gencode.v3c.UTR.bed.gz"

Use bjobs to check the all jobs have completed and ls to see the required files were generated and placed in the "Data" folder.

ls -la Data/gencode.*.gz

# Console output

-rw------- 1 s00yao00 s00yao00 2097963 2014-12-17 10:14 Data/gencode.v3c.exon_merged.bed.gz
-rw------- 1 s00yao00 s00yao00  277528 2014-12-17 10:14 Data/gencode.v3c.intergenic.bed.gz
-rw------- 1 s00yao00 s00yao00 1905992 2014-12-17 10:18 Data/gencode.v3c.intron_merged.bed.gz
-rw------- 1 s00yao00 s00yao00  535447 2014-12-17 10:25 Data/gencode.v3c.promoter_merged.bed.gz
-rw------- 1 s00yao00 s00yao00 1478470 2014-12-17 10:23 Data/gencode.v3c.transcript.bed.gz
-rw------- 1 s00yao00 s00yao00 1792723 2014-12-17 10:29 Data/gencode.v3c.UTR.bed.gz

Last updated