From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline
Geraldine A. Van der Auwera
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorMauricio O. Carneiro
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorChristopher Hartl
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorRyan Poplin
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorGuillermo del Angel
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorAmi Levy-Moonshine
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorTadeusz Jordan
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorKhalid Shakir
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorDavid Roazen
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorJoel Thibault
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorEric Banks
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorKiran V. Garimella
Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
Search for more papers by this authorDavid Altshuler
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorStacey Gabriel
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorMark A. DePristo
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorGeraldine A. Van der Auwera
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorMauricio O. Carneiro
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorChristopher Hartl
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorRyan Poplin
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorGuillermo del Angel
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorAmi Levy-Moonshine
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorTadeusz Jordan
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorKhalid Shakir
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorDavid Roazen
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorJoel Thibault
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorEric Banks
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorKiran V. Garimella
Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
Search for more papers by this authorDavid Altshuler
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorStacey Gabriel
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorMark A. DePristo
Genome Sequencing and Analysis Group, Broad Institute, Cambridge, Massachusetts
Search for more papers by this authorAbstract
This unit describes how to use BWA and the Genome Analysis Toolkit (GATK) to map genome sequencing data to a reference and produce high-quality variant calls that can be used in downstream analyses. The complete workflow includes the core NGS data-processing steps that are necessary to make the raw data suitable for analysis by the GATK, as well as the key methods involved in variant discovery using the GATK. Curr. Protoc. Bioinform. 43:11.10.1-11.10.33. © 2013 by John Wiley & Sons, Inc.
Literature Cited
- 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale sequencing. Nature 467: 1061-1073.
- DePristo, M.A., Banks, E., Poplin, R., Garimella, K.V., Maguire, J.R., Hartl, C., Philippakis, A.A., del Angel, G., Rivas, M.A., Hanna, M., McKenna, A., Fennell, T.J., Kernytsky, A.M., Sivachenko, A.Y., Cibulskis, K., Gabriel, S.B., Altshuler, D., and Daly, M.J. 2011. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43: 491-498.
- Fisher, R.A. 1922. On the interpretation of c2 from contingency tables, and the calculation of p. J. R. Stat. Soc. 85: 87-94.
- International HapMap 3 Consortium, Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M., Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu, F., Chang, K., Hawes, A., Lewis, L.R., Ren, Y., Wheeler, D., Gibbs, R.A., Muzny, D.M., Barnes, C., Darvishi, K., Hurles, M., Korn, J.M., Kristiansson, K., Lee, C., McCarrol, S.A., Nemesh, J., Dermitzakis, E., Keinan, A., Montgomery, S.B., Pollack, S., Price, A.L., 2Soranzo, N., Bonnen, P.E., Gibbs, R.A., Gonzaga-Jauregui, C., Keinan, A., Price, A.L., Yu, F., Anttila, V., Brodeur, W., Daly, M.J., Leslie, S., McVean, G., Moutsianas, L., Nguyen, H., Schaffner, S.F., Zhang, Q., Ghori, M.J., McGinnis, R., McLaren, W., Pollack, S., Price, A.L., Schaffner, S.F., Takeuchi, F., Grossman, S.R., Shlyakhter, I., Hostetter, E.B., Sabeti, P.C., Adebamowo, C.A., Foster, M.W., Gordon, D.R., Licinio, J., Manca, M.C., Marshall, P.A., Matsuda, I., Ngare, D., Wang, V.O., Reddy, D., Rotimi, C.N., Royal, C.D., Sharp, R.R., Zeng, C., Brooks, L.D., and McEwen, J.E. 2010. Integrating common and rare genetic variation in diverse human populations. Nature 467: 52-58.
- Li, H. and Durbin, R. 2010. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics (Oxford) 26: 589-595.
- Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and Durbin, R. 1000 Genome Project Data Processing Subgroup 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford) 25: 2078-2079.
- Mann, H.B. and Whitney, D.R. 1947. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18: 50-60.
- McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M.A. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20: 1297-1303.
- Mills, R.E., Luttig, C.T., Larkins, C.E., and Beauchamp, A. 2006. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 16: 1182-1190.
- Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29: 308-311.
- Wickham, H. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer, New York.