Annotation File Preparation – Defining Genomic Regions
In this section, we will use the reference annotation files retrieved from public databases to generate several genomic positional templates in BED format. We will use these files in the calculation of DNA methylation at specific regions, such as promoter, intergenic, CpG islands, etc.
500 Chromosomal Bins
We will use the makewindows program of the bedtools utilities to subset the genome into 500 bins for each chromosome.
cd ~/
bsub -q 16G -o stdout -e stderr "bedtools makewindows -g /work3/NRPB1219/chromInfo.txt -n 500 -s 1 | gzip > ~/Data/hg18.500bins.bed.gz"CpG Islands and Transcriptional Factor Binding Sites (TFBS)
The CpG island and TFBS annotation files downloaded from the UCSC genome browser is not in BED format. Hence we use awk to process them. We then pipe the output in the sortBed program of the bedtools utilities to make sure the BED files were properly coordinate-sorted.
cd ~/
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2,$3,"CpG|"$6,$8*10,"+" }' /work3/NRPB1219/cpgIslandExt.txt | sortBed | gzip > Data/cpgIslandExt.bed.gz
awk -F $'\t' 'BEGIN { OFS=FS } { if($6 >= 500) print $2,$3,$4,$5,$6,"+" }' /work3/NRPB1219/wgEncodeRegTfbsClustered.txt | sortBed | gzip > Data/wgEncodeRegTfbsClustered.bed.gzThe above command read the content of cpgIslandExt.txt :
chr1 18598 19673 CpG: 116 1075 116 787 21.6 73.2 0.83
chr1 124987 125426 CpG: 30 439 30 295 13.7 67.2 0.64
chr1 317653 318092 CpG: 29 439 29 295 13.2 67.2 0.62
chr1 427014 428027 CpG: 84 1013 84 734 16.6 72.5 0.64
chr1 439136 440407 CpG: 99 1271 99 777 15.6 61.1 0.84And produce cpgIslandExt.bed in BED format:
And also read the content of wgEncodeRegTfbsClustered.txt :
Perform a filtering on the score (6th column) and produce wgEncodeRegTfbsClustered.bed in BED format:
Promoters, Exonic, Intronic, UTR and intergenic Regions
The gene annotation file we used here was curated by the GENCODE project (http://www.gencodegenes.org/). The GTF format is a standardized way to present gene annotation information of a genome (including gene, transcript, UTR, exon, start and stop codon).
head -n 15 /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf
Please refer to online documentation for more information about the GTF format:
By executing the commands below, we will process the GTF file and generate genomic regions such as promoter regions, exonic, intronic, intergenic regions in BED format. The sortBed and mergeBed programs of the bedtools utilities was used to ensure the output is coordinate-sorted and overlapping regions were merged.
!!! Use bjobs to check the job has completed, and ls to make sure this file was created before proceed.
ls -la Data/gencode.v3c.exon_merged.bed.gz
!!! Use bjobs to check the all jobs have completed, and ls to make sure this file was created before proceed.
ls -la Data/gencode.v3c.transcript.bed.gz
The promoter region defined here is the transcription start site (TSS) upstream 1000 bp & TSS downstream 500 bp.
The print_utr.pl is adapted from http://davetang.org/muse/2013/01/18/defining-genomic-regions/.
Use bjobs to check the all jobs have completed and ls to see the required files were generated and placed in the "Data" folder.
ls -la Data/gencode.*.gz
Last updated
Was this helpful?