What is the MAM-pipeline?

Monoallelic gene expression is typically initiated early in the development of an organism. Dysregulation of monoallelic gene expression has already been linked to several non-Mendelian inherited genetic disorders. In humans, monoallelic DNA-methylation (MAM is deemed to be an important regulator of monoallelic gene expression, but only few examples are known. One important reason is that current, cost-affordable truly genome-wide methods to assess DNA-methylation are based on sequencing post enrichment. Here, we present a new methodology – the MAM-pipeline – based on classical population genetic theory, i.e. the Hardy-Weinberg theorem, that combines methylomic data from MethylCap-seq with associated SNP profiles to identify monoallelically methylated loci.

The Hardy-Weinberg theory states that in a large random-mating population with no selection, mutation or migration both the allele and genotype frequencies of a gene locus with two alleles are constant from generation to generation, and furthermore, that there is a simple relationship between these allele and genotype frequencies: if the alleles are A and a with frequencies p and q (= 1-p), respectively, then at equilibrium the genotype frequencies of AA, Aa and aa are p2, 2pq and q2, respectively.
The rationale behind the proposed methodology is that biallelic DNA-methylation results in MethylCap-seq data which is in Hardy-Weinberg equilibrium for each locus, i.e. if SNPs are present for a locus, both homozygous and heterozygous subjects will be detected at a predictable rate. However, in case of monoallelic methylation, heterozygous samples will no longer be detected resulting in deviation from the Hardy-Weinberg equilibrium, which can be measured.

The developed pipeline (Figure 1) first compares enrichment-based sequencing data of multiple samples to the public NCBI Single Nucleotide Polymorphism (SNP)-archive (dbSNP) in order to screen the obtained non-duplicate, uniquely mappable sequence reads for SNPs. Only SNP loci with an adequately coverage and allele frequency are retained and the effect of sequencing errors is further reduced by comparing the chance of a sequencing error with the chance of detecting genuine SNPs. For each single SNP locus, the Hardy-Weinberg theorem is then applied to evaluate whether the observed frequency of samples featured by a biallelic event is lower than randomly expected.

Starting from MethylCap-seq data of a mixture of 334 Caucasian human samples and an FDR of 0.1, this methodology allowed the identification of 80 monoallelically methylated loci, significantly more found than expected in promoter regions. Of these 80 loci, 25 have previously been linked to imprinting. Additional validation was done using both 14 whole-genome bisulfite sequencing (WGBS) data sets of diverse origin and mRNA-seq data of 16 normal tissues.
Importantly, the developed approach can be easily applied to other enrichment-based sequencing technologies, like the ChIP-seq-based identification of monoallelic histone modifications.

For a detailed description of the different steps of the pipeline, see the online version of the paper.

Below you can find links to download the preprocessed & mapped, dbSNP-searched, filtered and sequence error corrected & SNP data from the used 334 human MethylCap-seq samples (Additional_data.txt.gz). To protect the patients privacy, the SNP data for each SNP locus is randomly ordered, so that the SNPs for a sample can not be linked together. In addition, this data is also available per chromosome (chr*_iter1.txt.gz) that can be used as input for the statistical framework (provided as R- and perl-scripts). To perform the MAM analysis on the given data, run the following consecutive scripts (after unzipping the chr*_iter1.txt.gz files):

Smiley face
Figure 1: Overview of the MAM-pipeline
  1. monoallelic_iter1.R (= data-analytical framework with 1,000 iterations)
    1. input parameters
      • working directory (containing chr*_iter1.txt)
      • chromosome to analyze
      • number of cores used for analysis
    2. outputTextfile containing the obtained p-values for each SNP-loci after the first iteration (for example: p1_iter1.txt = obtained p-values for SNP-loci of chromosome 1)
    3. command-line$ Rscript monoallelic_iter1.R /workdir/ chr #cores## example: $ Rscript monoallelic_iter1.R /home/user/MAM/ 1 10
  2. filter_iter1.pl (= obtain SNP-loci with p-value <= 0.005 after performing monoallelic_iter1.R)
    1. input parameters
      • working directory (containing both chr*_iter1.txt and p*_iter1.txt)
      • chromosome to analyze
    2. outputTextfile containing the preprocessed SNP-loci for the analyzed chromosome with p-values <= 0.005 after the first iteration (for example: chr1_iter2.txt for chromosome 1)
    3. command-line$ perl filter_iter1.pl /workdir/ chr## example: $ perl filter_iter1.pl /home/user/MAM/ 1
  3. monoallelic_iter2.R (= data-analytical framework with 1,000,000 iterations)
    1. input parameters
      • working directory (containing chr*_iter2.txt obtained with filter_iter1.pl)
      • chromosome to analyze
      • number of cores used for analysis
    2. outputTextfile containing the obtained p-values for each SNP-loci after the second iteration (for example: p1_iter2.txt = obtained p-values for SNP-loci of chromosome 1)
    3. command-line$ Rscript monoallelic_iter2.R /workdir/ chr #cores## example: $ Rscript monoallelic_iter2.R /home/user/MAM/ 1 10

License

This program is free software: you can redistribute it and/or modify it under the terms of the
GNU General Public License as published by the Free Software Foundation, either version 3 of the License,
or (at your option) any later version.
You may obtain a copy of the License at http://www.gnu.org/licenses/.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Copyright (C) 2014 S. Steyaert et al.