In this section, we will use the reference annotation files retrieved from public databases to generate several genomic positional templates in BED format. We will use these files in the calculation of DNA methylation at specific regions, such as promoter, intergenic, CpG islands, etc.
500 Chromosomal Bins
We will use the makewindows
program of the bedtools utilities to subset the genome into 500 bins for each chromosome.
Copy cd ~/
bsub -q 16G -o stdout -e stderr "bedtools makewindows -g /work3/NRPB1219/chromInfo.txt -n 500 -s 1 | gzip > ~/Data/hg18.500bins.bed.gz"
CpG Islands and Transcriptional Factor Binding Sites (TFBS)
The CpG island and TFBS annotation files downloaded from the UCSC genome browser is not in BED format. Hence we use awk
to process them. We then pipe the output in the sortBed
program of the bedtools utilities to make sure the BED files were properly coordinate-sorted.
Copy cd ~/
awk -F $'\t' 'BEGIN { OFS=FS } { print $1,$2,$3,"CpG|"$6,$8*10,"+" }' /work3/NRPB1219/cpgIslandExt.txt | sortBed | gzip > Data/cpgIslandExt.bed.gz
awk -F $'\t' 'BEGIN { OFS=FS } { if($6 >= 500) print $2,$3,$4,$5,$6,"+" }' /work3/NRPB1219/wgEncodeRegTfbsClustered.txt | sortBed | gzip > Data/wgEncodeRegTfbsClustered.bed.gz
The above command read the content of cpgIslandExt.txt
:
Copy chr1 18598 19673 CpG: 116 1075 116 787 21.6 73.2 0.83
chr1 124987 125426 CpG: 30 439 30 295 13.7 67.2 0.64
chr1 317653 318092 CpG: 29 439 29 295 13.2 67.2 0.62
chr1 427014 428027 CpG: 84 1013 84 734 16.6 72.5 0.64
chr1 439136 440407 CpG: 99 1271 99 777 15.6 61.1 0.84
And produce cpgIslandExt.bed
in BED format:
Copy chr1 18598 19673 CpG|116 216 +
chr1 124987 125426 CpG|30 137 +
chr1 317653 318092 CpG|29 132 +
chr1 427014 428027 CpG|84 166 +
chr1 439136 440407 CpG|99 156 +
And also read the content of wgEncodeRegTfbsClustered.txt
:
Copy 585 chr1 51 504 ZBTB33 392 . 51 504 0 1 453 0 25
585 chr1 130 306 c-Jun 459 . 130 306 0 1 176 0 25
585 chr1 247 292 NFKB 1000 . 247 292 0 1 45 0 25
585 chr1 7125 8886 GR 99 . 7125 8886 0 1 1761 0 25
585 chr1 9914 10042 HEY1 44 . 9914 10042 0 1 128 0 25
Perform a filtering on the score (6th column) and produce wgEncodeRegTfbsClustered.bed
in BED format:
Copy chr1 247 292 NFKB 1000 +
chr1 81184 81397 Rad21 1000 +
chr1 227464 227809 Rad21 602 +
chr1 227505 227800 CTCF 1000 +
chr1 530491 530823 c-Myc 1000 +
Promoters, Exonic, Intronic, UTR and intergenic Regions
The gene annotation file we used here was curated by the GENCODE project (http://www.gencodegenes.org/ ). The GTF format is a standardized way to present gene annotation information of a genome (including gene, transcript, UTR, exon, start and stop codon).
head -n 15 /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf
Copy ##description: evidence-based annotation of the human genome (NCBI36), version 3c
##provider: GENCODE
##contact: fsk@sanger.ac.uk
##format: gtf
##date: 2009-10-01
chr1 HAVANA gene 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENSG00000223972"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1"; level 2; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL transcript 1737 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL exon 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL exon 2476 2584 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL exon 3084 4275 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL CDS 4022 4249 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL start_codon 4022 4024 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL stop_codon 4250 4252 . + 0 gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL UTR 1737 2090 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
chr1 ENSEMBL UTR 2476 2584 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "RP11-34P13.1"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "RP11-34P13.1-201"; level 3; havana_gene "OTTHUMG00000000961";
Please refer to online documentation for more information about the GTF format:
By executing the commands below, we will process the GTF file and generate genomic regions such as promoter regions, exonic, intronic, intergenic regions in BED format. The sortBed
and mergeBed
programs of the bedtools utilities was used to ensure the output is coordinate-sorted and overlapping regions were merged.
Copy cd ~/
bsub -q 16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"exon\") print \$1,\$4-1,\$5 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | mergeBed -i - | gzip > Data/gencode.v3c.exon_merged.bed.gz"
!!! Use bjobs
to check the job has completed, and ls
to make sure this file was created before proceed.
ls -la Data/gencode.v3c.exon_merged.bed.gz
Copy # Console output
-rw------- 1 s00yao00 s00yao00 2097963 2014-12-19 23:24 Data/gencode.v3c.exon_merged.bed.gz
Copy cd ~/
bsub -q 16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"gene\") print \$1,\$4-1,\$5 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | subtractBed -a - -b Data/gencode.v3c.exon_merged.bed.gz | sortBed | mergeBed -i - | gzip > Data/gencode.v3c.intron_merged.bed.gz"
bsub -q16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"gene\") print \$1,\$4-1,\$5 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | complementBed -i - -g /work3/NRPB1219/chromInfo.txt | gzip > Data/gencode.v3c.intergenic.bed.gz"
bsub -q 16G -o stdout -e stderr "awk -F $'\t' 'BEGIN { OFS=FS } { if(\$3 == \"transcript\") print \$1,\$4-1,\$5,\$7,\$9 }' /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sed 's/^\\([[:alnum:]]*\\)\\t\\([[:digit:]]*\\)\\t\\([[:digit:]]*\\)\\t\\([+|-]\\).*transcript_id \"\\([[:alnum:]]*\\)\".*/\\1\\t\\2\\t\\3\\t\\5\\t0\\t\\4/' | sortBed | gzip > Data/gencode.v3c.transcript.bed.gz"
!!! Use bjobs
to check the all jobs have completed, and ls
to make sure this file was created before proceed.
ls -la Data/gencode.v3c.transcript.bed.gz
Copy # Console output
-rw------- 1 s00yao00 s00yao00 1478470 2014-12-19 23:33 Data/gencode.v3c.transcript.bed.gz
The promoter region defined here is the transcription start site (TSS) upstream 1000 bp & TSS downstream 500 bp.
Copy cd ~/
bsub -q 16G -o stdout -e stderr "zcat Data/gencode.v3c.transcript.bed.gz | awk -F $'\t' 'BEGIN { OFS=FS } { if(\$6 == \"+\") { left = \$2-1000; right = \$2+500 } else { left = \$3-500; right = \$3+1000 } if(left < 0) left = 0; print \$1,left,right }' | sortBed | mergeBed -i - | gzip > Data/gencode.v3c.promoter_merged.bed.gz"
The print_utr.pl
is adapted from http://davetang.org/muse/2013/01/18/defining-genomic-regions/ .
Copy cd ~/
bsub -q 16G -o stdout -e stderr "perl /work3/NRPB1219/print_utr.pl /work3/NRPB1219/gencode.v3c.annotation.NCBI36.gtf | sortBed | gzip > Data/gencode.v3c.UTR.bed.gz"
Use bjobs
to check the all jobs have completed and ls
to see the required files were generated and placed in the "Data" folder.
ls -la Data/gencode.*.gz
Copy # Console output
-rw------- 1 s00yao00 s00yao00 2097963 2014-12-17 10:14 Data/gencode.v3c.exon_merged.bed.gz
-rw------- 1 s00yao00 s00yao00 277528 2014-12-17 10:14 Data/gencode.v3c.intergenic.bed.gz
-rw------- 1 s00yao00 s00yao00 1905992 2014-12-17 10:18 Data/gencode.v3c.intron_merged.bed.gz
-rw------- 1 s00yao00 s00yao00 535447 2014-12-17 10:25 Data/gencode.v3c.promoter_merged.bed.gz
-rw------- 1 s00yao00 s00yao00 1478470 2014-12-17 10:23 Data/gencode.v3c.transcript.bed.gz
-rw------- 1 s00yao00 s00yao00 1792723 2014-12-17 10:29 Data/gencode.v3c.UTR.bed.gz