LD Clumping Tutorial - Statistical and Genomic Epidemiology Laboratory (SGEL)

The following tutorial explains how to perform LD clumping using the PLINK program (Purcell et al 2007). The tutorial uses two genome-wide association study (GWAS) summary results files (dataset1 and dataset2) containing the five essential columns required by SECA (and iSECA) and publicly available reference genotype data.

First four rows of example dataset1:
SNP        EA  NEA  P       BETA
rs3131972  A   G    0.2799  0.12839
rs3131969  A   G    0.2813  0.12221
rs3131967  T   C    0.2794  0.13102

First four rows of example dataset2:
SNP        EA  NEA  P       BETA
rs3131972  A   G    0.5351  0.02955
rs3131969  A   G    0.2268  0.06485
rs3131967  C   T    0.1754 -0.07881

To prepare the files for SECA/iSECA upload, SNP subsets will be generated and extracted from the original dataset files.

SNP subsets can be extracted using the following awk command to match two unsorted files on a certain column/field:

awk ‘NR==FNR {FILE1[$1]=$0; next} \
($1 in FILE1) {print $0}’ \
FILE1 FILE2 \
> FILE2_MatchedToFILE1

The “$1” in “FILE1[$1]” specifies the field number to match on (i.e., SNP names) in FILE1.
The “$1” in “($1 in FILE1)” specifies the field number to match on (i.e., SNP names) in FILE2.
FILE1 contains the subset of SNPs you want to extract from FILE2.
The subset of SNPs extracted from FILE2 are written to FILE2_MatchedToFILE1.

Step 1. Obtain the overlap of SNPs between the two datasets

Assuming the SNP names are in field #1 in both your dataset1 and dataset2 files, to obtain the overlap execute the following awk command:

awk ‘NR==FNR {FILE1[$1]=$0; next} \
($1 in FILE1) {print $0}’ \
dataset2 dataset1 \
> dataset1_subset_overlapping_dataset2

NOTE: Older versions of PLINK (i.e., v1.07) may output SNPs from your clump input file (‘dataset1_subset_overlapping_dataset2′) that are not in the LD reference data—these SNPs have NOT been assessed for LD independence. New versions of PLINK (i.e., v1.90) do not have this issue. You should therefore use the new versions of PLINK (which also run considerably faster), or first extract the SNPs that are in the LD reference, for example:

awk ‘NR==FNR {FILE1[$1]=$0; next} \
($1 in FILE1) {print $0}’ \
dataset2 dataset1 \
> dataset1_subset_overlapping_dataset2.tmp

awk ‘NR==FNR {FILE1[$2]=$0; next} \
($1 in FILE1) {print $0}’ \
hapmap_CEU_r23a_filtered.bim \
dataset1_subset_overlapping_dataset2.tmp \
> dataset1_subset_overlapping_dataset2

Step 2. Identify a set of independent SNPs in dataset1 (overlapping dataset2) via linkage disequilibrium (LD) clumping (Purcell et al. 2007)

To extract a subset of SNPs in dataset1 you need to reference a set of genotype data in PLINK format in order to estimate the LD between SNP-SNP pairs. The LD metric is the squared correlation (r²) based on genotypic allele counts. PLINK-format HapMap 2 (release 23a), HapMap 3 (release 2) and 1000 Genomes Project (1kGP) data (1000G PhaseI 2012 v3 Updated Integrated Phase 1 Release) for CEU, CHB+JPT and YRI population samples may be downloaded via the Resources links below.

To perform p-value informed LD clumping on your ‘dataset1_subset_overlapping_dataset2’ file using the ‘hapmap_CEU_r23a_filtered’ reference data, issue the following command (requires PLINK to be installed as well as the ‘hapmap_CEU_r23a_filtered’ reference data and ‘dataset1_subset_overlapping_dataset2’ file in the work directory):

plink \
–noweb \
–bfile hapmap_CEU_r23a_filtered \
–clump dataset1_subset_overlapping_dataset2 \
–clump-field P \
–clump-p1 1 \
–clump-p2 1 \
–clump-r2 0.1 \
–clump-kb 1000 \
–out dataset1_subset_overlapping_dataset2_clump1

The default output from PLINK –clump analysis (‘dataset1_subset_overlapping_dataset2_clump1.clumped’) lists the independent (index) SNPs in field #3. We need to extract this column, which can be done using the following awk command:

awk ‘{ print $3 }’ \
dataset1_subset_overlapping_dataset2_clump1.clumped \
> dataset1_subset_overlapping_dataset2_clump1.SNPs

We also need extract the index SNP data from ‘dataset1_subset_overlapping_dataset2’. This can be done by matching on field #3 in FILE1 and field #1 in FILE2, using the following awk command:

awk ‘NR==FNR {FILE1[$3]=$0; next} \
($1 in FILE1) {print $0}’ \
dataset1_subset_overlapping_dataset2_clump1.clumped \
dataset1_subset_overlapping_dataset2 \
> dataset1_subset_overlapping_dataset2.clump1

The resulting file ‘dataset1_subset_overlapping_dataset2.clump1’, contains a subset of GWAS summary results from dataset1 for a set of independent (index) SNPs (r² > 0.1) for all pairs of SNPs within 1 Mb of each other.

Now perform p-value informed LD clumping on your ‘dataset1_subset_overlapping_dataset2.clump1’ file to ensure none of the index SNPs within 10 Mb of each other are in long-range LD (r² > 0.1):

plink \
–noweb \
–bfile hapmap_CEU_r23a_filtered \
–extract dataset1_subset_overlapping_dataset2_clump1.SNPs \
–clump dataset1_subset_overlapping_dataset2.clump1 \
–clump-field P \
–clump-p1 1 \
–clump-p2 1 \
–clump-r2 0.1 \
–clump-kb 10000 \
–out dataset1_subset_overlapping_dataset2_clump2

If you do not want to perform p-value informed LD clumping then you could repeat the above PLINK analysis using a version of the –clump input file (‘dataset1_subset_overlapping_dataset2’) in which all of the observed p-values have been replaced with “0.5”, for example, by using the following awk command:

awk ‘{print $1,0.5}’ \
dataset1_subset_overlapping_dataset2 \
> dataset1_subset_overlapping_dataset2.P0.5

NB: make sure to change the column name for the dummy p-values from “0.5” to “P”, using a text editor or simple command, for example:

sed -i ‘s/SNP 0.5/SNP P/g’ \
dataset1_subset_overlapping_dataset2.P0.5

Then run PLINK as above but using the dummied p-value file (‘dataset1_subset_overlapping_dataset2.P0.5’):

plink \
–noweb \
–bfile hapmap_CEU_r23a_filtered \
–clump dataset1_subset_overlapping_dataset2.P0.5 \
–clump-field P \
–clump-p1 1 \
–clump-p2 1 \
–clump-r2 0.1 \
–clump-kb 1000 \
–out dataset1_subset_overlapping_dataset2_clump1

awk ‘{ print $3 }’ \
dataset1_subset_overlapping_dataset2_clump1.clumped \
> dataset1_subset_overlapping_dataset2_clump1.SNPs

awk ‘NR==FNR {FILE1[$3]=$0; next} \
($1 in FILE1) {print $0}’ \
dataset1_subset_overlapping_dataset2_clump1.clumped \
dataset1_subset_overlapping_dataset2.P0.5 \
> dataset1_subset_overlapping_dataset2.P0.5.clump1

plink \
–noweb \
–bfile hapmap_CEU_r23a_filtered \
–extract dataset1_subset_overlapping_dataset2_clump1.SNPs \
–clump dataset1_subset_overlapping_dataset2.P0.5.clump1 \
–clump-field P \
–clump-p1 1 \
–clump-p2 1 \
–clump-r2 0.1 \
–clump-kb 10000 \
–out dataset1_subset_overlapping_dataset2_clump2

Tip: You can force specific SNPs to be included by giving them smaller p-values.

Step 3. Extract the set of independent SNPs

As mentioned above, the default output from PLINK –clump analysis (‘dataset1_subset_overlapping_dataset2_clump2.clumped’) lists the independent SNPs in field #3, while field #1 contains the SNP name in the ‘dataset1_subset_overlapping_dataset2’ GWAS summary file. Therefore we need to match on field #3 in FILE1 and field #1 in FILE2, which can be done using the following awk command:

awk ‘NR==FNR {FILE1[$3]=$0; next} \
($1 in FILE1) {print $0}’ \
dataset1_subset_overlapping_dataset2_clump2.clumped \
dataset1_subset_overlapping_dataset2 \
> dataset1.independent

The resulting file ‘dataset1.independent’, contains a subset of GWAS summary results from dataset1 for a set of independent SNPs.

Although iSECA does not require the SNPs to be independent in dataset2 (i.e., it will automatically extract the overlap between the uploaded files), you should also extract the independent SNPs from dataset2 to speed up the iSECA upload/analysis, which can be done using the following awk command:

awk ‘NR==FNR {FILE1[$3]=$0; next} \
($1 in FILE1) {print $0}’ \
dataset1_subset_overlapping_dataset2_clump2.clumped \
dataset2 \
> dataset2.independent

The resulting two files ‘dataset1.independent’ and ‘dataset2.independent’ can now be uploaded to iSECA for analysis…

Resources

PLINK (binary) format genotype data file downloads for HapMap 2, HapMap 3 and 1000 Genomes Project (1kGP) data:

1000G PhaseI v3 CEU (minimac names if no rsID) – duplicate variant names removed
1000G PhaseI v3 CEU (minimac names if no rsID)
1000G PhaseI v3 CHB+JPT (minimac names if no rsID)
1000G PhaseI v3 YRI (minimac names if no rsID)

HapMap3 r2 CEU
HapMap3 r2 CHB+JPT
HapMap3 r2 YRI

HapMap2 r23a CEU
HapMap2 r23a CHB+JPT
HapMap2 r23a YRI

For most users I suggest the 1000G downloads will suffice.

Please also download the ‘1000G PhaseI v3 Marker Details Table‘ for more information on the 1kGP markers. This table is a useful resource to obtain base-pair positions for SNP rsIDs as it contains the following fields {minimac_names, minimac_names_if_no_rsID, chr_b37, pos_b37, Intrpl_pos_b36, chr_b36, pos_b36} for 31,326,389 variants.

An R script (‘SECA_Ranalysis_HeatmapPerm.R’) allowing users to increase the number of replicates (default = 1000) for the permutation of heatmap p-values using the “independent_aligned_effects.txt” file generated by SECA or iSECA is provided in the ‘SECA_local_version.zip‘ download.

References

HapMap 3 (release 2) QC+ SNP genotype data formatted as PLINK files (hapmap3_r2_b36_fwd.qc.poly.tar.bz2) was downloaded from the Broad Institute’s HapMap 3 (now retired) webpage (http://www.broadinstitute.org/~debakker/hapmap3r2.html). HapMap 2 (release 23a) filtered (MAF > 0.01 and genotyping rate > 0.95) SNP genotype data (founders) formatted as PLINK files was downloaded from the PLINK resources webpage: http://pngu.mgh.harvard.edu/~purcell/plink/res.shtml. 1000 Genomes Project (1kGP) data (1000G PhaseI 2012 v3 Updated Integrated Phase 1 Release) was download from the MaCH download page and later converted to PLINK format. Note: only founder (unrelated) individuals are utilised in the LD estimation.

Nyholt DR (2014) SECA: SNP effect concordance analysis using genome-wide association summary results. Bioinformatics. 2014 Jul 15;30(14):2086-8. doi: 10.1093/bioinformatics/btu171.

Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC. PLINK: a toolset for whole-genome association and population-based linkage analysis. Am J Hum Genet. 2007 81(3):559-75.

R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.

iSECA’s SNP effect alignment was originally based on METAL (Willer et al 2010) source code (‘generic-metal-2011-03-25.tar.gz’). However, the more recent ‘gwama2.1_SECAalign’ program uses coded adapted from the GWAMA program (source files in gwama2.1_SECAalign_source.zip).

Willer CJ, Li Y, Abecasis GR. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics. 2010 26(17):2190-1.

Mägi R, Morris AP. GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics. 2010 28(11):288.

________________________________________________________

Page last updated June 2, 2023.