## Introduction

Genetic connectedness quantifies the extent of linkage between individuals across management units. The concept of genetic connectedness can be extended to measure the connectedness level between training and testing sets in whole-genome prediction. However, there is no user-friendly software tool available to compute a comprehensive list of connectedness statistics. Therefore we developed the genetic connectedness analysis R package, GCA, which utilizes pedigree and genomic data to measure the connectedness between individuals across units.

## Connectedness Statistics

The connectedness statistics implemented in this R package can be classified into two core functions: prediction error variance (PEV) and variance of unit effect estimates (VE). The PEV-derived metrics include prediction error variance of differences (PEVD), coefficient of determination (CD), and prediction error correlation (r). These PEV-derived metrics can be summarized at the unit level as the average PEV of all pairwise differences between individuals across units, average PEV within and across units, or using a contrast vector. VE-derived metrics include variance of differences in management unit effects (VED), coefficient of determination of VED (CDVED), and connectedness rating (CR). Three correction factors accounting for the number of fixed effects can be applied for each VE-derived metric. These include no correction (0), correcting for one fixed effect (1), and correcting for two or more fixed effects (2). The R code is integrated with C++ to improve computational speed using the Rcpp package (Eddelbuettel and François 2011). We expect this R package will provide a comprehensive and effective tool for genetic connectedness analysis and whole-genome prediction. A comprehensive list of all connectedness statistics implemented in this R package can be summarized as following. The details of these connectedness statistics can be found in the paper (Yu and Morota 2019).

- Prediction error variance (PEV)
- Prediction error variance of difference (PEVD)
- Individual Average PEVD
- Group Average PEVD
- Contrast PEVD

- Coefficient of Determination (CD)
- Individual Average CD
- Group Average CD
- Contrast CD

- Prediction Error Correlation (r)
- Individual average r
- Group average r
- Contrast r

- Prediction error variance of difference (PEVD)
- Variance of unit effect estimates (VE)
- Variance of differences in unit effects (VED)
- VED0
- VED1
- VED2

- Coefficient of Determination of VED (CDVED)
- CDVED0
- CDVED1
- CDVED2

- Connectedness Rating (CR)
- CR0
- CR1
- CR2

- Variance of differences in unit effects (VED)

## Application of the GCA package

### Installing the GCA package from GitHub

### Data preparation

The dataset `GCcattle`

contains two objects `cattle.pheno`

and `cattle.W`

, which include phenotypic and marker information, respectively. The details can be obtained by typing `?GCcattle`

.

`## [1] 2500 6`

`## [1] 2500 10000`

The heritability of simulated phenotype was set to 0.6 with \(\sigma^2_a\) = 0.6 and \(\sigma^2_e\) = 0.4.

Below we construct two incidence matrices and a genomic relationship matrix. The X1 and X2 contain one and two fixed effects, respectively.

```
X1 <- model.matrix(~ -1 + factor(cattle.pheno$Unit)) # incidence matrix of unit effect with intercept excluded
X2 <- model.matrix(~ -1 + factor(cattle.pheno$Unit) + factor(cattle.pheno$Sex)) # incidence matrix of unit effect and sex
G <- computeG(cattle.W) # genomic relationship matrix
```

`## Genomic relationship matrix has been computed. Number of SNPs removed: 99`

### Available connectedness statistics in the GCA package

The following section lists all available connectedness statistics in the GCA package, which are available by setting the argument of `statistic`

.

- PEVD_IdAve : Individual average PEVD, the optional argument of ‘scale’ is available.
- PEVD_GrpAve : Groupd average PEVD, the optional arguments of ‘scale’ and ‘diag’ are available.
- PEVD_contrast: Contrast PEVD, the optional argument of ‘scale’ is available.
- CD_IdAve : Individual average CD.
- CD_GrpAve : Group average CD, the optional argument of ‘diag’ is available.
- CD_contrast : Contrast CD.
- r_IdAve : Individual average r.
- r_GrpAve : Group average r, the optional argument of ‘diag’ is available.
- r_contrast : Contrast r.
- VED0 : Variance of estimate of unit effects differences. The optional argument of ‘scale’ is available.
- VED1 : Variance of estimate of unit effects differences with the correction of unit effect. The optional argument of ‘scale’ is available.
- VED2 : Variance of estimate of unit effects differences with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required and the optional argument of ‘scale’ is available.
- CDVED0 : Coefficient of determination of VED, the optional argument of ‘diag’ is available.
- CDVED1 : Coefficient of determination of VED with the correction of unit effect. The optional argument of ‘diag’ is available.
- CDVED2 : Coefficient of determination of VED with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required and the optional argument of ‘diag’ is available.
- CR0 : Connectedness rating.
- CR1 : Connectedness rating with the correction of unit effect.
- CR2 : Connectedness rating with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required.

### Examples of connectedness measures across units

Below we list some examples of connectedness statistics in the GCA package.

#### PEV-derived statistic: PEVD_IdAve (pairwise vs. overall)

The `gca`

function is the main engine in the GCA package. The following example illustrates the pairwise individual average PEVD between all units. Based on the results, units 1 and 8 are the most connected (PEVD_IdAve = 0.2818) and the least connected units are units 4 and 6 (PEVD_IdAve = 0.3083). The PEVD statistic ranges from 0 to 1, with the smaller value indicates more connectedness.

```
PEVD_IdAve <- gca(Kmatrix = G, Xmatrix = X1, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'PEVD_IdAve',
NumofMU = 'Pairwise')
round(PEVD_IdAve, digits = 4)
```

```
## U1 U2 U3 U4 U5 U6 U7 U8
## U1 NA 0.5198 0.5258 0.5288 0.5249 0.5299 0.5201 0.5178
## U2 0.5198 NA 0.5368 0.5410 0.5372 0.5425 0.5330 0.5303
## U3 0.5258 0.5368 NA 0.5456 0.5416 0.5476 0.5360 0.5358
## U4 0.5288 0.5410 0.5456 NA 0.5453 0.5525 0.5414 0.5386
## U5 0.5249 0.5372 0.5416 0.5453 NA 0.5460 0.5355 0.5347
## U6 0.5299 0.5425 0.5476 0.5525 0.5460 NA 0.5425 0.5405
## U7 0.5201 0.5330 0.5360 0.5414 0.5355 0.5425 NA 0.5315
## U8 0.5178 0.5303 0.5358 0.5386 0.5347 0.5405 0.5315 NA
```

Alternatively, the `gca`

function can return an overall connectedness, which averages all pairwise PEVD_IdAve between units by setting `NumofMU`

to `'Overall'`

.

```
PEVD_IdAve <- gca(Kmatrix = G, Xmatrix = X1, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'PEVD_IdAve',
NumofMU = 'Overall')
PEVD_IdAve
```

`## [1] 0.5358172`

Following the above example, the group average PEVD and contrast PEVD can be easily calculated by changing the argument `statistic`

to PEVD_GrpAve and PEVD_contrast, respectively.

#### PEV-derived statistic: CD_IdAve

The pairwise individual average CD statistic between units is shown in the following example. The most connected units was found between units 1 and 8 (CD_IdAve = 0.8590). On the other hand, the least connected design was found between units 4 and 6 (CD_IdAve = 0.8455). The larger CD statistics indicates the greater connectedness.

```
CD_IdAve <- gca(Kmatrix = G, Xmatrix = X1, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_IdAve',
NumofMU = 'Pairwise')
round(CD_IdAve, digits = 4)
```

```
## U1 U2 U3 U4 U5 U6 U7 U8
## U1 NA 0.7397 0.7375 0.7347 0.7370 0.7347 0.7394 0.7408
## U2 0.7397 NA 0.7324 0.7287 0.7308 0.7285 0.7336 0.7347
## U3 0.7375 0.7324 NA 0.7264 0.7290 0.7268 0.7316 0.7320
## U4 0.7347 0.7287 0.7264 NA 0.7261 0.7230 0.7291 0.7296
## U5 0.7370 0.7308 0.7290 0.7261 NA 0.7263 0.7313 0.7321
## U6 0.7347 0.7285 0.7268 0.7230 0.7263 NA 0.7282 0.7295
## U7 0.7394 0.7336 0.7316 0.7291 0.7313 0.7282 NA 0.7340
## U8 0.7408 0.7347 0.7320 0.7296 0.7321 0.7295 0.7340 NA
```

#### PEV-derived statistic: CD_GrpAve

A group average CD, is computed in the following example. We can see that the most connected units are units 2 and 7 (CD_GrpAve = 0.8073), while units 1 and 6 (CD_GrpAve = 0.7615) the least connectedness.

```
CD_GrpAve <- gca(Kmatrix = G, Xmatrix = X1, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CD_GrpAve',
NumofMU = 'Pairwise')
round(CD_GrpAve, digits = 4)
```

```
## U1 U2 U3 U4 U5 U6 U7 U8
## U1 NA 0.6850 0.6679 0.6788 0.6747 0.6494 0.7097 0.6747
## U2 0.6850 NA 0.6938 0.6839 0.6774 0.6642 0.7090 0.6830
## U3 0.6679 0.6938 NA 0.6729 0.6738 0.6640 0.6953 0.6583
## U4 0.6788 0.6839 0.6729 NA 0.6795 0.6595 0.7052 0.6751
## U5 0.6747 0.6774 0.6738 0.6795 NA 0.6663 0.7020 0.6748
## U6 0.6494 0.6642 0.6640 0.6595 0.6663 NA 0.6774 0.6556
## U7 0.7097 0.7090 0.6953 0.7052 0.7020 0.6774 NA 0.6902
## U8 0.6747 0.6830 0.6583 0.6751 0.6748 0.6556 0.6902 NA
```

#### VE-derived statistic: VED0

The following example illustrates the VED0 statistic. Here, the smaller value indicates the greater connectedness. We can see that the most connected units are found between units 1 and 8 (VED0 = 0.0124). On the other hand, units 4 and 6 (VED0 = 0.0445) shows the least connectedness.

```
VED0 <- gca(Kmatrix = G, Xmatrix = X1, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'VED0',
NumofMU = 'Pairwise')
round(VED0, digits = 4)
```

```
## U1 U2 U3 U4 U5 U6 U7 U8
## U1 NA 0.0213 0.0296 0.0367 0.0322 0.0373 0.0246 0.0205
## U2 0.0213 NA 0.0470 0.0553 0.0509 0.0563 0.0438 0.0394
## U3 0.0296 0.0470 NA 0.0623 0.0575 0.0638 0.0491 0.0472
## U4 0.0367 0.0553 0.0623 NA 0.0653 0.0728 0.0587 0.0541
## U5 0.0322 0.0509 0.0575 0.0653 NA 0.0656 0.0521 0.0495
## U6 0.0373 0.0563 0.0638 0.0728 0.0656 NA 0.0593 0.0555
## U7 0.0246 0.0438 0.0491 0.0587 0.0521 0.0593 NA 0.0435
## U8 0.0205 0.0394 0.0472 0.0541 0.0495 0.0555 0.0435 NA
```

#### VE-derived statistic: CDVED1

An example of CDVED1 statistic, which accounts for the number of observations in the unit is shown in the following example. The most connected units are found between units 2 and 7 (CDVED1 = 0.8073), whereas units 1 and 6 (CDVED1 = 0.7615) shows the least connectedness.

```
CDVED1 <- gca(Kmatrix = G, Xmatrix = X1, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED1',
NumofMU = 'Pairwise')
round(CDVED1, digits = 4)
```

```
## U1 U2 U3 U4 U5 U6 U7 U8
## U1 NA 0.6850 0.6679 0.6788 0.6747 0.6494 0.7097 0.6747
## U2 0.6850 NA 0.6938 0.6839 0.6774 0.6642 0.7090 0.6830
## U3 0.6679 0.6938 NA 0.6729 0.6738 0.6640 0.6953 0.6583
## U4 0.6788 0.6839 0.6729 NA 0.6795 0.6595 0.7052 0.6751
## U5 0.6747 0.6774 0.6738 0.6795 NA 0.6663 0.7020 0.6748
## U6 0.6494 0.6642 0.6640 0.6595 0.6663 NA 0.6774 0.6556
## U7 0.7097 0.7090 0.6953 0.7052 0.7020 0.6774 NA 0.6902
## U8 0.6747 0.6830 0.6583 0.6751 0.6748 0.6556 0.6902 NA
```

#### VE-derived statistic: CDVED2

This example shows the calculation of CDVED2, where two fixed effects (units and sex) are accounted for using the correction factor. The larger CDVED2 value indicates the greater connectedness. We can see that the most connected units are between units 2 and 7 (CDVED2 = 0.8072), while units 1 and 6 (CDVED2 = 0.7615) shows the least connectedness.

```
CDVED2 <- gca(Kmatrix = G, Xmatrix = X2, sigma2a = sigma2a, sigma2e = sigma2e,
MUScenario = as.factor(cattle.pheno$Unit), statistic = 'CDVED2',
NumofMU = 'Pairwise', Uidx = 8)
round(CDVED2, digits = 4)
```

```
## U1 U2 U3 U4 U5 U6 U7 U8
## U1 NA 0.6849 0.6675 0.6788 0.6747 0.6494 0.7096 0.6747
## U2 0.6849 NA 0.6934 0.6838 0.6774 0.6642 0.7088 0.6829
## U3 0.6675 0.6934 NA 0.6728 0.6736 0.6639 0.6953 0.6582
## U4 0.6788 0.6838 0.6728 NA 0.6795 0.6595 0.7052 0.6751
## U5 0.6747 0.6774 0.6736 0.6795 NA 0.6663 0.7020 0.6748
## U6 0.6494 0.6642 0.6639 0.6595 0.6663 NA 0.6773 0.6556
## U7 0.7096 0.7088 0.6953 0.7052 0.7020 0.6773 NA 0.6901
## U8 0.6747 0.6829 0.6582 0.6751 0.6748 0.6556 0.6901 NA
```

## References

Eddelbuettel, Dirk, and Romain François. 2011. “Rcpp: Seamless R and C++ Integration.” *Journal of Statistical Software* 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.

Yu, Haipeng, and Gota Morota. 2019. “GCA: An R Package for Genetic Connectedness Analysis Using Pedigree and Genomic Data.” *bioRxiv*. https://doi.org/10.1101/696419.