Annotation File Preparation – Defining Genomic Regions

In this section, we will use the reference annotation files retrieved from public databases to generate several genomic positional templates in BED format. We will use these files in the calculation of DNA methylation at specific regions, such as promoter, intergenic, CpG islands, etc.

500 Chromosomal Bins

We will use the makewindows program of the bedtools utilities to subset the genome into 500 bins for each chromosome.

cd ~/

bsub -q 16G -o stdout -e stderr "bedtools makewindows -g /work3/NRPB1219/chromInfo.txt -n 500 -s 1 | gzip > ~/Data/hg18.500bins.bed.gz"

CpG Islands and Transcriptional Factor Binding Sites (TFBS)

The CpG island and TFBS annotation files downloaded from the UCSC genome browser is not in BED format. Hence we use awk to process them. We then pipe the output in the sortBed program of the bedtools utilities to make sure the BED files were properly coordinate-sorted.

cd ~/
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2,$3,"CpG|"$6,$8*10,"+" }' /work3/NRPB1219/cpgIslandExt.txt | sortBed | gzip > Data/cpgIslandExt.bed.gz
awk -F $'\t' 'BEGIN { OFS=FS } { if($6 >= 500) print $2,$3,$4,$5,$6,"+" }' /work3/NRPB1219/wgEncodeRegTfbsClustered.txt | sortBed | gzip > Data/wgEncodeRegTfbsClustered.bed.gz

The above command read the content of cpgIslandExt.txt :

chr1    18598   19673   CpG: 116        1075    116     787     21.6    73.2    0.83
chr1    124987  125426  CpG: 30 439     30      295     13.7    67.2    0.64
chr1    317653  318092  CpG: 29 439     29      295     13.2    67.2    0.62
chr1    427014  428027  CpG: 84 1013    84      734     16.6    72.5    0.64
chr1    439136  440407  CpG: 99 1271    99      777     15.6    61.1    0.84

And produce cpgIslandExt.bed in BED format:

And also read the content of wgEncodeRegTfbsClustered.txt :

Perform a filtering on the score (6th column) and produce wgEncodeRegTfbsClustered.bed in BED format:

Promoters, Exonic, Intronic, UTR and intergenic Regions

The gene annotation file we used here was curated by the GENCODE project (http://www.gencodegenes.org/). The GTF format is a standardized way to present gene annotation information of a genome (including gene, transcript, UTR, exon, start and stop codon).

head -n 15 /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf

Please refer to online documentation for more information about the GTF format:

By executing the commands below, we will process the GTF file and generate genomic regions such as promoter regions, exonic, intronic, intergenic regions in BED format. The sortBed and mergeBed programs of the bedtools utilities was used to ensure the output is coordinate-sorted and overlapping regions were merged.

!!! Use bjobs to check the job has completed, and ls to make sure this file was created before proceed.

ls -la Data/gencode.v3c.exon_merged.bed.gz

!!! Use bjobs to check the all jobs have completed, and ls to make sure this file was created before proceed.

ls -la Data/gencode.v3c.transcript.bed.gz

The promoter region defined here is the transcription start site (TSS) upstream 1000 bp & TSS downstream 500 bp.

The print_utr.pl is adapted from http://davetang.org/muse/2013/01/18/defining-genomic-regions/.

Use bjobs to check the all jobs have completed and ls to see the required files were generated and placed in the "Data" folder.

ls -la Data/gencode.*.gz

Last updated

Was this helpful?