%
% MANUSCRIPT STARTS HERE
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\documentclass[]{MathAppl18}
%Polish letter codding
\usepackage[OT4,T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{polski}
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{filecontents}
% Cyrylica
%\input cyracc.def
%\newcommand\cyrfamily{\fontencoding{OT2}\fontfamily{wncyr}
%\selectfont\cyracc}
%\DeclareTextFontCommand{\textcyr}{\cyrfamily}
%\volume{46}
\usepackage[english,polish]{babel} %PS
\usepackage{lastpage} %PS
\usepackage{etex} %PS
\usepackage{color}
\usepackage{cite} %PS
%\usepackage{here} %PS
\RequirePackage[numbers]{natbib}
\renewcommand{\bibsection}{}
%\usepackage[cam,a4,center]{crop}suppress this lines.
\usepackage[colorlinks=true]{hyperref}
%\usepackage{hyperref}
\hypersetup{
pdftitle={Teplate Mathematica Applicanda }, %%<--To wymieni?
pdfauthor=Nowy Autor, %%<--TO wymieni?
colorlinks,
urlcolor=blue,
filecolor=magenta,
citecolor=green,
linkbordercolor={1 1 1}, % set to white
citebordercolor={1 1 1}, % set to white
urlbordercolor={ 1 1 1} % set to white
}
\RequirePackage[hyperpageref]{backref}
\renewcommand*{\backref}[1]{}
\renewcommand*{\backrefalt}[4]{
\ifcase #1
No cited.
\or
Cited on p. #2.
\else
Cited on pp. #2.
\fi}
\newcommand{\orcid}[1]{\href{https://orcid.org/#1}{\includegraphics[scale=.05]{orcid.png}}}
\newcommand{\orcidcode}[1]{\href{https://orcid.org/#1}{#1}}
\newcommand{\orcidcodeLINK}[1]{ORCID \href{https://orcid.org/#1}{https://orcid.org/#1}}
\def\repo{http://wydawnictwa.ptm.org.pl/index.php/matematyka-stosowana/article/viewArticle}
\newcommand*{\eudml}[1]{\href{http://eudml.org/doc/#1}{#1}}
%
%
%\newcommand*{\doi}[1]{\href{http://dx.doi.org/#1}{doi: #1}}
%\newcommand*{\MR}[1]{\href{http://www.ams.org/mathscinet-getitem?mr=#1&return=pdf}{#1}}
%\newcommand*{\ZBL}[1]{\href{http://www.zentralblatt-math.org/zmath/en/advanced/?q=an:#1&format=complete}{Zbl #1}}
%\newcommand*{\ZBLid}[1]{\href{https://zbmath.org/?q=ai:#1}{ZBLid:#1}} %%<-definicja KSz
%\newcommand*{\JFM}[1]{\href{http://www.zentralblatt-math.org/zmath/en/advanced/?q=an:#1&format=complete}{JFM #1}}
%\newcommand*{\eLIBru}[1]{\href{https://elibrary.ru/item.asp?id=#1}{eLibrary.ru #1}}
\pdfoutput=1
\pdfcompresslevel=0
\usepackage{graphicx}
\usepackage{wrapfig}
\usepackage{subfigure}
\graphicspath{{../Figures/},{./Figures/},{./Pictures/}}
\usepackage{multicol}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Submission information - GLOBAL %%
%% Inserted by editor %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\firstpage{i}
%www
\LogoG{\includegraphics[width=0.18\textwidth]{ma.png}}
\volume{46}
\fasc{2}
\years{2018}
%MS
\LogoGMS{\includegraphics[width=0.18\textwidth]{ms.png}}
\volumeMS{26}
\numberMS{63}
%\wwwfalse
\wwwtrue
% LOCAL DEFINITINS
%+==============================
%\newtheorem{thm}{Theorem}[section]
%\newtheorem{cor}[thm]{Corollary}
%\newtheorem{lem}[thm]{Lemma}
%\newtheorem{prob}[thm]{Problem}
%\newtheorem{ass}[thm]{Assumption}
%% A numbered theorem with a fancy name:
%\newtheorem{mainthm}[theorem]{Main Theorem}
%% Numbered objects of "non-theorem" style (text roman):
%\theoremstyle{definition}
%\newtheorem{defin}[theorem]{Definition}
%\newtheorem{rem}[theorem]{Remark}
%\newtheorem{exa}[theorem]{Example}
%+====================================================
%\tpauthortrue % PS11lis12
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Submission information %%
%% Inserted by editor %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\secnameMS{Applied Probability}
\pages{193--206}
%\receivedPL[3 wrze?nia 2014]{ 26 listopada 2012}
\received[5th of June 2017]{ 11th of January 2017}
\lastrevision{}
\logo{}{}{}%Do not change 9 above lines
\def\doinum{10.14708/ma.v43i1.xxx}%% http://wydawnictwa.ptm.org.pl/index.php/matematyka-stosowana/article/view/614
%Do not change 9 above lines
%%Preliminaries
%%Preliminaries
\title[MathAppl KZMBM Template]{Template for Mathematica Applicanda
}
%\dedicated{Dedicated to ....}
%\dedicated{Dedicated to ....}
%\tpauthorfalse % PS11lis12
\tpauthortrue % PS11lis12
%%The First author
\author[N. Author]{Nowy Author\orcid{0000-0002-9584-4083}}%(Albuquerque)
%\thanks{This research was partially supported by is based on the invited lecture at International Conference on Topology and Applications held in August 23--27, 1999, at Kanagawa University in Yokohama, Japan}
\affiliation{Wroc{\l}aw University of Technology}
\address{Department of Mathematics and Computer Science\\
\indent Wroc{\l}aw University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław 50-370\\
\indent \orcidcodeLINK{0000-0002-9584-4083}}
\email{Nowy.Author@pwr.wroc.pl}
\city{Wroc{\l}aw}
%\urladdr{www.a.b.c/\php first}
\author[N. Author]{Nowy Author}%(Albuquerque)
%\thanks{This research was partially supported by is based on the invited lecture at International Conference on Topology and Applications held in August 23--27, 1999, at Kanagawa University in Yokohama, Japan}
\affiliation{Wroc{\l}aw University of Technology}
\address{Department of Mathematics and Computer Science\\
\indent Wrocław University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław 50-370}
\email{Nowy.Author@pwr.wroc.pl}
\city{Wroc{\l}aw}
%%The second author
%\author[M. Bogdan]{M. Bogdan}
%\thanks{ble ble ble}
%\affiliation{Wroc{\l}aw University of Technology and Jan D{\l}ugosz University in Cz\c{e}stochowa}
%\address{Department of Mathematics and Computer Science, Wroc{\l}aw University of Technology, Wybrze\.ze Wyspia\'nskiego 27, Wroc{\l}aw 50-370\\
%Department of Mathematics and Computers Science, Jan D{\l}ugosz University in Cz\c{e}stochowa}
%\email{Malgorzata.Bogdan@pwr.wroc.pl}
%\city{Wroc?aw}
%\urladdr{www.im.pwr.wroc.pl\~mbogdan}
%\author[Third]{Third Author}
%\thanks{blu blu blu}
%\affiliation{Third University}
%\address{Third Address}
%\email{third@g.h.i}
%\urladdr{www.g.h.i/\php third}
\comm{Anna Marciniak-Czochra}
\subjclass[2010]{Primary: 62J05; Secondary: 92D20}
\keywords{statistical genetics, quantitative trait loci, model selection, sparse linear regression, Bayesian Information Criterion}
\begin{document}
\vspace{-5ex}
%\Poczatek
%\Chapter
%\pagenumbering{roman}
\setcounter{page}{193} %%This command starts the numerations of pages
\selectlanguage{english}\Polskifalse
%\selectlanguage{polish}\Polskitrue
\begin{abstract}
Development of genetics in recent years has led to a situation in which we are able to look at the DNA chains with high precision and collect vast amounts of information. In addition, it turned out that the relationships between genes and traits are more complex than previously thought. Because of not the best communication between mathematicians and geneticists, knowledge of methods other than the classic among the latter group is still small.
\end{abstract}
%\tableofcontents
\section{DNA as the carrier of genetic information.}
Probably nobody has to be convinced about the huge diversity of living organisms on our planet. However, each form of life has a common structure made up of nucleotides (i.e. deoxyribose, a phosphate group and a nitrogenous base) called DNA. When we look closely at this molecule, we see that its exact composition depends on the species with which we are dealing; what is more, it is a kind of guide of how an organism is to be built. For this reason, we may be tempted to treat it as a measure of similarity between species. It is believed that the DNA of chimpanzee in 98\% does not differ from the human. And can we find some similarities between man and something as different as yeast? It turns out that we share with them a quarter of genes.
\subsection{Genes}
What are genes? There is no simple answer to this question, at least at the present level of development of science. This is due to the fact that when this term was created, not much was known about DNA. A gene was understood as a theoretical unit of inheritance, that is something that significantly affects the phenotype (set of features) of an individual and is passed down from generation to generation. Only later we tried to find a material object, which would correspond to the abstract entity. In textbooks we will find the answer to those searches: a gene is a piece of DNA, determining formation of one molecule of protein or RNA. In recent years, however, our confidence in understanding what we are dealing with has decreased. The gene seems to be something more complex, and therefore its definitions as well. We will hear voices that maybe it is even worth to give up this idea \cite{YosKunTak1998:Heart}.
In this paper we will understand a gene as a segment of DNA which has a meaning (affecting a trait more or less indirectly), and which is present in at least two versions, so-called alleles. Depending on whether we have a gene in version {\it A} or {\it a}, it may result, for example, in a higher or lower risk of developing a disease.
\subsection{Inter-individual differences in DNA}
From this point we will be interested in inter-individual differences in the DNA. We focus on one genre and look for places that make two carrots or two people differ from each other. Such differences are smaller; DNA of two random people will most likely be the same in 99.9\%. This per mille is however enough to find many differences between us (it is worth noting that also environment has impact on our features and it is actually not known what the proportions are).
At this point we have to make some distinction between finding genes in humans and other species. To do this, let us have a closer look at DNA structure. What we are most interested in are the nitrogenous bases. Usually they come in four versions: adenine, cytosine, guanine and thymine. Two DNA chains are different due to the fact that in the same location there are various nitrogen bases. In animals and plants we are generally looking for longer segments of DNA, which can occur in different versions, while in humans we most often consider each of nucleotides of an individual, and those in at least one percent of persons are different than the rest, so-called single nucleotide polymorphisms (Single Nucleotide Polymorphism, SNP). The Figure \ref{r1} presents schematically how a gene and SNP usually look.
\begin{figure}
\begin{center}
\includegraphics[width=0.48\textwidth]{Pictures/gen_snp2.pdf}
\end{center}
\vspace{-20pt}
\caption{Gen i SNP}
\label{r1}
\end{figure}
\subsection{Why look for genes?}
At the end of this paragraph we will answer the question, how the information about which places in the DNA are responsible for what could be useful. In humans, we can better understand the cause of the disease and thus develop a more effective medicament. We are also able to much more quickly assess risk and start the treatment earlier. In animals, such as cows, if we discover which genes are responsible for milk production, for example, we can interbreed only the appropriate individuals. Information about the location of a gene is also useful in the cultivation of fruit. If we want to grow in our orchard only sweet fruit, instead of for decades to cross different varieties, looking for the optimal characteristics, we can immediately use these with appropriate parameters \cite{Sad}.
\section{General model.} We would now like to go into mathematics and translate information about genotypes of an individual. We have identified alleles by {\it A} and {\it a}, which may seem unreasonable, because what symbol you could choose for the third allele? It turns out that this situation, i.e., the occurrence of a third or subsequent versions, is so unlikely that in general most often this opportunity is not included. This is due to the fact that a mutation in a DNA is rare, so next one in the same place hardly occurs. We could, therefore, encode the genetic information by only two numbers, except for the fact that DNA is in chromosomes which occur in pairs. In the corresponding chromosomes we do not have the same strings as one strand is inherited from a mother and the other from a father. Thus, in a given place within the DNA we have three choices: \textit{AA}, \textit{aA} (or \textit{AA}, but the order is not important), or \textit{AA}.
In summary, for each individual we can indicate a sequence of genotypes (e.g. encoded as -1, 0 and 1) and the value of the trait of interest. Individual genotypes will be qualitative explanatory variables and the trait will be a dependent variable.
\section{Tests in single markers}
Our task is to identify which of the genes significantly influence the trait under consideration. And it is worth noting that, indeed, we will focus on locating them and the kind of dependence not necessarily concerns us. At the beginning let us try to approach this problem in the simplest possible way.
If we examine a quantitative trait, we can -- by a suitable test -- verify null hypothesis that the average value of a trait does not depend on the genotype of the marker. When its distribution does not differ significantly from normal, we often use the classical Student's t-test (if we consider only two versions of genotype) or F test for analysis of variance. If the distribution of a trait is not normal, we can apply the appropriate transformation, or instead of values of a trait consider ranks.
\subsection{Linear regression}
It is common practice in testing the significance of a given marker to use a linear regression model. We are trying to fit a model
$$Y_i=\beta_0 + \beta_j X_{ij} + \varepsilon_i\;,\;\;i=1,\ldots,n,$$
where $\varepsilon_i$ is a random variable with the normal distribution, mean 0 and variance $\sigma^2$, while $X_{ij}$ is the genotype of $j$-th marker. When it has only two values, for example {\it aa} i {\it AA}, commonly the following encoding is used:
$$X_{ij}=\left\{\begin{array}{ccc}
-1/2,&&aa\\
1/2,&&AA
\end{array}
\right.$$
The problem occurs when we consider three versions of genotypes, since then the relationships between numbers are important. Therefore, it is best to introduce an additional variable that will solve this problem. The following encoding is used most often:
$$X_{ij}=\left\{\begin{array}{ccc}
1,&&aa\\
0,&&aA\\
-1,&&AA
\end{array}
\right.$$
and
$$Z_{ij}=\left\{\begin{array}{ccc}
-1/2,&&aa\;\; \mbox{or}\;\; AA\\
1/2,&&aA
\end{array}
\right.$$
More on encoding can be read at work \cite{YosKunTak1998:Heart}. The considered model is now in the form of
$$Y_i=\beta_0 + \beta_j X_{ij} + \gamma_j Z_{ij}+ \varepsilon_i\;.$$
Using regression models, the null hypothesis presented earlier is now $\beta_j=0$, or $\beta_j=\gamma_j=0$. In order to verify this hypothesis we can apply in both cases the F-Snedecor test, in which we examine the ratio of the squares of residuals to the sum of squares explained by the model or the likelihood ratio test. When in the model we only have the $X_{ij}$, we can also use the Student's t-test, in which the value of the estimator $\hat{\beta_j}$ is divided by its standard deviation. We will not go into detail about these tests, because they are classic approach to study the significance of the regression coefficients. It can be also show that for the models considered by us, F-Snedecor test is equivalent to test F for analysis of variance (and the Student's t-test for the model with two genotypes is equivalent to F-Snedecor test).
\subsection{The problem with multiple testing}
When we use tests in individual markers, regardless of whether they are classic tests or linear regression approach, we face the problem of multiple testing: if we carry out a single test at the significance level $\alpha$, then we have no guarantee that we will maintain this level performing more tests. For example, if we have a thousand markers, then performing tests at the level of 0.05 (and assuming that the marker genotypes are independent), we can expect about 50 false discoveries. This is not acceptable and therefore we apply corrections for multiple testing to control the probability of making at least one error of the first kind (Family Wise Error Rate, FWER). The simplest is the Bonferroni correction, in which each test is performed at the level of $\alpha/m$, where $m$ is the number of markers. Then we have the guarantee that FWER will not exceed $\alpha$. This adjustment, however, becomes problematic, when the genotypes of the markers are strongly correlated, which in experimental populations is typical. Then the level of $\alpha/m$ is too low and it may happen that an essential gene escapes our attention. One solution is to use permutation tests \cite{YosKunTak1998:Heart}, which adjust the critical value for the test to the correlation structure between the markers (in fact, between the values of the statistics). The procedure goes in such a way that we permute the vector $Y$ several times, for each permutation we count values of test statistics and we find their maximum. As the critical value we take the $1-\alpha$ quantile of the distribution of the resulting maxima.
Then, we reject $k_F$ hypotheses with the p-values less than or equal to $ p_{[k_F]}$. The procedure may seem strange, but it was shown that it controls at a level not exceeding $\alpha$ the so-called fraction of false discoveries (False Discovery Rate, FDR), i.e.
$$FDR=E\left(\frac{V}{R}|R>0\right),$$
where $ R $ is the number of all rejected hypotheses, and $V$ is the number of false rejections.
\section{Multiple regression.} The main problem of testing in single markers is the fact that we completely ignore the impact of other markers. If more genes have connection with a trait (it is usually true), it is a better idea to attempt to fit a model that contains all these essential genes. In addition, genes may interact with each other. All of this can be modeled using the multiple regression. If we consider only interactions of second order, then a model for the case of two versions of genotypes is of the form of
$$Y_i=\beta_0+\sum_{j=1}^m \beta_j X_{ij} + \sum_{1\leq j 2015-07-08T20:15:13.267Z:
%
\newpage
\begin{center}
{\bf Lokalizacja genów.}\\ %The optimal time for a delegation to specialized contractors -- outsourcing --outside-resource-using
%\href{http://wydawnictwa.ptm.org.pl/index.php/matematyka-stosowana/article/view/289/282}{Jan Poleszczuk}
\href{\repo/597}{Piotr Szulc}
\end{center}
\medskip
\begin{abstract}
Rozwój genetyki w ostatnich latach doprowadził do sytuacji, w której jesteśmy w stanie przyjrzeć się łańcuchom DNA z dużą precyzją i zebrać ogromne ilości informacji. Oprócz tego okazało się, że zależności między genami a cechami są bardziej skomplikowane niż się wcześniej wydawało. Te dwie rzeczy spowodowały, że niezbędna stała się ścisła współpraca między genetykami a matematykami, których zadaniem jest opracowanie specjalnych metod, radzących sobie w specyficznych i trudnych problemach genetycznych. Artykuł zawiera przegląd zarówno klasycznych jak i najnowszych podejść do problemu lokalizacji genów, czyli wskazywania miejsc w łańcuchu DNA, które istotnie wpływają na interesujące nas cechy. Z powodu nie najlepszej komunikacji między matematykami i genetykami, znajomość metody innych niż klasyczne wśród tej drugiej grupy jest wciąż niewielka.
\end{abstract}
\Koniec
\end{document}