DNA research has seen a surge in popularity over the past years. Test takers want to discover more about their health risks, ethnicity estimates or focus on finding other cousins, in an attempt to verify their genealogical paper trail. Many who focus on researching their ancestry want to know as much as possible about all their ancestors, but fail to see the limits of DNA testing. As autosomal DNA inheritance is completely random, people do not inherit DNA from all their ancestors, which will make verifying certain lines impossible. In this short blog post I will share some details about sharing DNA with direct ancestors, in an attempt to make statistics on DNA inheritance more readily available to many.
Whereas the Shared cM project has managed to make knowledge about shared cM amongst close cousin relationships easily accessible through collecting self-reported data from the crowd, a practical challenge prevents us from doing so when the goal is to gather information on direct ancestors instead. While many person-grandparent and some person-greatgrandparent relationships are available in the data of the Shared cM project, gathering data on more distant relationships is impossible because most of these distant ancestors were never able to do a DNA test. To overcome this problem, I wrote a piece of software that can simulate DNA inheritance between cousins and ancestors. Using knowledge about the way DNA recombines, this has enabled me to generate the distributions of shared DNA across generations.
In the following paragraphs I will show the distributions of shared DNA for several generations, and shortly discuss their implications and how they compare with Shared cM project data.
Distributions & Statistics
Quite some data on shared cM between people and their grandparents (1106 samples) is available in the 4th version of the Shared cM project. This makes it the ideal sample to test whether my simulation actually works. The first step, of course, was to generate my virtual grandparents and their virtual grandchildren. Over 250,000 simulations later, the shared DNA distribution of the 1,000,000 simulated relationships that appeared was the following:
Examining the distribution, the first thing I noticed is that the number of cM of DNA grandchildren and grandparents share is following a normal distribution. This is not strange, as a person will inherit 50% DNA of each parent, which means that every pair of two grandparents should contribute 50% of that persons DNA. If one grandparent contributes 100 cM more, the other grandparent will have to contribute 100 cM less.
The median cM shared is 1790.12 cM, 95% CI [1789.32, 1790.90], which is exactly 1/4th of the number of cM a person has (according to FamilyTreeDNA that’s 3580.24 for each parent, the number I took). This number is slightly higher than the reported mean in the Shared cM project (1754), but this can be explained by the fact that not every segment will always be picked up by companies’ software. Due to random mutations, missing measurements and conservative estimates, the estimated number of shared cM might be lower than is actually the case. Similar issues arise with the lowest and highest shared cM values. The distribution can be examined more closely in the following table, showing both data from the Shared cM project as well as information from my model.
|Lowest Shared cM value||984|
|00.50% of the distribution||1014.15||1011.38, 1017.04|
|02.50% of the distribution||1194.95||1193.39, 1196.71|
|25.00% of the distribution||1583.44||1582.56, 1584.29|
|Average Shared cM value||1754|
|50.00% of the distribution||1790.12||1789.32, 1790.90|
|75.00% of the distribution||1996.80||1996.09, 1997.60|
|97.50% of the distribution||2385.28||2383.84, 2386,87|
|Largest Shared cM value||2462|
|99.50% of the distribution||2566.08||2563.07, 2568.95|
Less data is available on the generation above: the greatgrandparents ([x] samples). It makes me wonder what I will be able to uncover for this, and the next generations.
Unlike the previous distribution, there is a slight right skew in the distribution. This means that within the range of observed relationships, the median shared DNA is lower than the average shared DNA. As such, more people will share a smaller amount of DNA than average, whereas there is a long “tail” of people who share larger amounts of DNA than one might expect based on the average amount of DNA shared. The generated distribution can be examined in more detail in the table below.
|00.50% of the distribution||343.42||341.59, 344.66|
|02.50% of the distribution||455.84||454.95, 457.09|
|Lowest Shared cM value||485|
|25.00% of the distribution||726.37||725.70, 726.89|
|50.00% of the distribution||884.63||884.09, 885.22|
|Average Shared cM value||887|
|75.00% of the distribution||1052.37||1051.61, 1053.02|
|97.50% of the distribution||1394.45||1392.78, 1395.84|
|Largest Shared cM value||1486|
|99.50% of the distribution||1567.85||1565.13, 1570.62|
In contrast to earlier relationships, no distributions on shared DNA with 2x-greatgrandparents are readily available. It thus becomes impossible to compare the generated distribution with real-life data.
The distribution of shared cM has become more skewed than the distributions mentioned before. On top of that, 4 of the 1 million simulated relationships turned out to not share any DNA at all. When testing using consumer DNA tests even more ancestors might not pop up, as the smallest segment size accounted for is usually limited to ~7cM.
|00.50% of the distribution||99.48||98.48, 100.23|
|02.50% of the distribution||160.89||160.36, 161.50|
|25.00% of the distribution||327.74||327.30, 328.03|
|50.00% of the distribution||434.45||434.07, 434.80|
|75.00% of the distribution||552.88||552.50, 553.29|
|97.50% of the distribution||809.46||808.57, 810.48|
|99.50% of the distribution||945.62||943.58, 947.70|
So, what about greatgreatgreatgrandparents? What is the likelihood of these ancestors not sharing any DNA with you? What does the distribution look like now?
The percentage of cases in which someone is not sharing DNA with any of the 3x-greatgrandparents is even larger: 609 of 1 million simulated relationships (~0.06%) had 0cM shared DNA. Given that a person has 32 3x-greatgrandparents the majority of people will still share DNA with all ancestors within this generations, but this will no longer be the case as we dive further into history in a next post.
|00.50% of the distribution||17.44||17.07, 17.82|
|02.50% of the distribution||46.23||45.88, 46.53|
|25.00% of the distribution||142.29||142.03, 142.56|
|50.00% of the distribution||210.98||210.69, 211.22|
|75.00% of the distribution||291.04||290.73, 291.37|
|97.50% of the distribution||475.38||474.48, 476.19|
|99.50% of the distribution||576.62||575.28, 578.09|
When taking a DNA test it is important to understand that you will not have inherited the same amount of DNA from every ancestor. You will inherit more DNA from some ancestors than others, and by generation 6 you will already have a realistic chance that you have not inherited any DNA at all. The further up an ancestor is in your tree, the less likely that you will be a DNA match.
However, because of the skewed distributions above the opposite can also happen: in some cases, you will share large amounts of DNA on certain lines. On these lines you might find, for instance, that you will discover more DNA matches that go much further back in history than you will find on other lines. As DNA inheritance is random, this means that testing older family members and 2nd or 3rd cousins can help you find many more relationships! All in all, any conclusions (cousin matches , ethnicity estimates) drawn on your DNA are only as accurate as the DNA you have.
The top photo is a cut-out from a photo made by Fotopersbureau het Zuiden in 1948, now in the collection of the Brabants Historisch Informatie Centrum (identification number 1634-005713).