Application of Statistics in Biological and Clinical Sciences
Statistics is the science of collecting, representing and analyzing data. The science of statistics is applied to various branches of academics which include econometrics, economics, life sciences, mathematics, physical sciences and life sciences. When such data deals with biological or clinical variables the specialty is referred as Biostatistics or Clinical Statistics (Altman 167). This field of science helps to understand the prevalence, cause-effect relationships and significance of an experimental intervention. The endpoints of statistical calculations indicate the reliability, validity and reproducibility of the same observation under different experimental or ambient conditions (Cumming 27-28).
Biostatistics is very important to understand the efficacy or toleration profile of drugs, or the incidence of the specific disease under certain conditions, and also helps to compare the effects of one variable with another. The present article will elucidate the applications of statistical data in the field of biology and medical sciences. The discourse of the article would be on the principles of collecting data, representing the collected data and relevant analysis of such data. The article would provide case study oriented approach to disseminating the data. This will help in practical understanding and implementation of the subject in day to day practice (Health 123-154).
In the field of biology or clinical settings, often individuals are unable to understand the relevance or significance of data. The issues are related to inappropriate selection of samples for conducting an experimental analysis. Further, there can be inappropriate selection of statistical tests of significance, which either underestimates the collected data or overestimates the collected. Both such deviations are undesirable, since clinical or biological endpoints are linked to well being of patients. Fabricated data or insufficient analysis of data may be detrimental. Such data may jeopardize the endpoints of experimentation or may lead to selection of ineffective treatment modality. Hence, the knowledge of statistics in the field of biology or medical sciences is highly essential (Redmond & Colton 32-36).
A renowned pharmaceutical company wants to make a research, regarding the efficacy of an antihypertensive molecule which they have recently introduced in the market. The drug “X” belongs to the group of beta adrenergic receptor blockers. Initially, the company wants to conduct research on male hypertensive patients, without any co-morbid disease conditions.
Collection of Data: Importance of Sampling
Data for experimentation should be collected, so as to represent the entire population assumed for the experimentation. This is because the entire population cannot be included in the research, and hence, a group of individuals who represent the demographics of the population are selected for the study. This group of individuals is referred as samples. Thus, biostatistics deals with samples and not the entire population. Extreme care should be implemented in selecting the appropriate sample for a specific study. If appropriate sampling methods are not implemented, it may lead to bias in the experimental analysis and interpretation of results. Thus, sampling should be such so that the experimental results in a sample could be reproduced and validated in similar individuals belonging to the population, but were not included in the study (Health 123-154).
Various forms of sampling techniques are used in statistical analysis. The common statistical tests include stratified random sampling, purposive sampling and judgmental sampling. The principle of sampling specifies that a sample should project a specified population in exact proportion of gender, age groups, disease states, socio-economic profile and educational backgrounds, as per the criteria of an experiment. Sampling helps to eliminate chances of bias in an experiment (Health 123-154).
Hence, in the case study above purposive sampling will be deployed to select only male volunteers for the study who are either freshly detected with hypertension or their blood pressure has not reached the target blood pressure, even with other antihypertensive medications. Individuals who are suffering from diabetes, dyslipidemia or left ventricular failure would not be included in this study. However, the sample should match the target population in terms of economic profile and social profile. Based on an initial purposive sampling, a stratified sampling may be initiated to calculate the number of individuals, who will be included from each socio-economic group, based on their proportion in the target population.
Sampling must ensure that the data collected follows a normal distribution. A normal distribution indicates that data collection is uniform. This means data represents all sets of values of a variable to be analyzed in the study. For the case study taken in this article, normal distribution indicates that the sample should have range of blood pressure values at higher end, middle end and lower end. If samples are included with higher and lower ends of blood pressure, the distribution will be leptokurtic and peaked. On the other hand, if only the middle range of blood pressure values is given a priority, the distribution becomes platykurtic. Under both circumstances, the results will suffer from experimental bias and jeopardize the aim of the study. Hence, samples should follow a normal distribution, which means it should be mesokurtic (Health 123-154).
For the current case study data was collected from 20 male individuals, belonging to African-American ethnicity. Only systolic blood pressure would be considered for the analysis. The collected data is represented below:
Individual Systolic Blood Pressure Age
1 150 45
2 170 48
3 210 49
4 200 50
5 170 45
6 150 50
7 160 45
8 200 45
9 170 48
10 150 48
11 160 45
12 200 50
13 170 45
14 150 45
15 160 45
The above data represents that all individuals were taken from the age of 45-50 years (for effective standardization and controlling for age related confounding variables). Confounding variables are those, which can influence other variables. For example, blood pressure may be affected by age, and also it can happen the applied drug may be less or more metabolized with regard to age and the effective concentration will not be same for each individual. Controlling for confounding variables will reduce the chances of bias in the results and increase the power and validity of results. The collected data may be considered appropriate since all grades of hypertensive patients are included in the study. Therefore, the collected sample has a high probability of adhering to a normal distribution.
Representation of Data: A Quick Glimpse for Comparison
The collected data should be analyzed with the help of statistical tests or from visual depictions. Representation of data provides the backbone for generating statistical tests of significance and also for the aspect of preliminary analysis. Data can be represented through various methods. Such methods include frequency polygon, frequency tables, pie diagram, bar diagram, histogram. The basic philosophy of representing a data is to indicate the number of cases belonging to each class. It can also reflect a particular statistical parameter (measures of central location) in different classes considered for the study. Representation of data helps in comparison between two sets of observation or cases (Health 123-154).
For example, the collected data reflects that 8 individuals have blood pressure over 170mm Hg, while 7 individuals have blood pressure below 170mm Hg. This indicates 52% have blood pressure (170 & above), while 48% individuals had blood pressure (169 and below). This is easily portrayed in a pie diagram and reduces need to search for data in the raw data tables and thus saves time.
Analysis of Data
Statistics of Location
For analysis of data and implementing statistical tests of significance, it is important to specify the descriptive statistics. Descriptive statistics indicates the measures of central tendency and measures or dispersion. Measures of central tendency indicate those statistical parameters, which will not change in a sample. On the other hand, measures of dispersion indicate that such parameters might suffer from fluctuation within a sample and from sample to sample in the same population or from sample to sample in different populations. Measures of central tendency include mean, median and mode. Mean represents the average value of a variable in a sample. Median represents that value of a variable in a sample beyond and blow which 50% of cases fall. On the other hand, mode represents most frequent set of values of a variable within a given distribution (Health 123-154).
In the example considered as the case study, the average (mean) of blood pressure readings is 171.33mmHg. The median value is 170mmHg because beyond or below this value, exactly 50% cases will fall (7 above and 7 below). On the other hand, the mode of the distribution is 150mm Hg, because this blood pressure was exhibited by most number of subjects who were considered in the sample.
Statistics of Dispersion
The statistics of dispersion indicates the extent to which the measures of raw score which can vary from the central tendencies in a sample. Moreover, it can also indicate the dispersion of a central tendency in a sample with respect to the population central tendency. The effective measures of statistics of dispersion deployed in statistical calculations are standard deviation, standard error and variance. Standard deviation indicates the root of squared deviations of raw scores from the sample mean. The value is expressed as +. On the other hand, standard error indicates the root of squared deviations of the sample mean from the population mean. Standard error of difference between the means is another statistics of dispersion, which indicates the difference between the means of two samples, with respect to their difference in means (parametric mean) in the two populations (Health 123-154).
Variance indicates the square of standard deviation. Theoretically, it indicates whether the mean of a sample is subjected to change with application of treatment variables. If there is an added variance component after a treatment variable has been applied, it would be considered there has been an added variance component. The inference would be that treatment variable has created a change or impact on the mean of the sample. On the other, hand if the mean of a sample is not subjected to change with application of treatment variables, then it would be considered there are no added variance components. The inference would be that treatment variable has not created any change or impact on the mean of the sample. Such inferences are the guiding principle for conducting ANOVA (Analysis if variance) test (Health 123-154).
With regard to the case study considered in the article the statistics of dispersion may be represented below:
Statistics of dispersion values
Standard Deviation 21.54
Standard Error of mean 20.76
Inferential statistics indicate the meaning of a particular study and the significance of the observed results. For the purpose of inferential statistics, various kinds of statistical tests of significance are conducted. The common types of tests of significance which are conducted include “Students t” test, Chi-square Test and ANOVA test. Different tests of significance are considered under different perspectives (Health 123-154). If the variable under study is a continuous measurement variable (measured up to decimal places) and the number of subjects in a sample is at least more than six, “t test” can be the best option to compare between two data sets. On the other hand, when the data is qualitative or a discontinuous measurement variable (cannot be measured up to decimal places), then, chi square test is the best option to compare between sets of data. Moreover, chi-square tests may also be conducted when the sample size is too low for conducting other statistical tests of significance. When the sample size is very large ANOVA may be conducted for continuous measurement variables (Vaughan 146-152).
The parameter which is evaluated is either “t” or “Chi-square”. T can be calculated from the difference between the means in two sets of observations divided by standard error of difference between the means.
t= Mean of observation 1-Mean of observation 2/ Standard error of difference
On the other hand chi-square test relies on the calculation of parameter “chi-square”. Chi-square is calculated by summing up the difference between observed and expected frequencies between different sets of observations (Myers, 124-154).
Chi Square= (observed frequency-expected frequency)12/ Expected frequency + (observed frequency-expected frequency)22/ Expected frequency +…………………………………
The calculated t score or chi-square score is tallied against critical t score or critical chi-square values to establish level of significance (Health 123-154). Returning to the example taken for the case study let the different blood pressures after application of 5 mg of antihypertensive “X” is as follows:
Individual Systolic Blood Pressure
(before applying the drug) Systolic Blood Pressure (after drug application)
1 150 148
2 170 140
3 210 210
4 200 120
5 170 140
6 150 130
7 160 120
8 200 120
9 170 120
10 150 130
11 160 120
12 200 150
13 170 140
14 150 140
15 160 170
There has been clearly a difference between the means of the two samples as represented below:
Hence, it is observed that drug “X” has significantly reduced the systolic blood pressure in experimental subjects (p< 0.0004). To consider the statement that drug “X” has significantly reduced the systolic blood pressure, “p” value needs to be analyzed (Craparo 889-891).
Before a statistical test of significance is conducted, it s important to frame the experimental hypothesis. The hypothesis which was considered in the case study was based on the rejection of null hypothesis. The null hypothesis would contend that there is no significant difference between the group means before and after application of a treatment variable (dose of a drug). The null hypothesis would be retained if the p value is > 0.05. A p value > 0.05 indicates that out of 100 observations, such difference between means in the pre-treatment and post-treatment groups have occurred due to chance factors of random sampling. This is because out of 100 observations, more than 5 observations is considered to have happened by chance factors of random sampling. Hence, it would be concluded, that drug “X” is ineffective, in reducing the blood pressure in experimental subjects.
On the other hand alternative hypothesis would contend that there is significant difference between the group means before and after application of a treatment variable (dose of a drug). The alternate hypothesis would be retained if the p value is < 0.05. A p value < 0.05 indicates that out of 100 observations, such difference between means in the pre-treatment and post-treatment groups have not occurred due to chance factors of random sampling. This is because out of 100 observations, less than 5 observations are considered to have happened by chance factors of random sampling (Vaughan 146-152). Thus happening by chance is considered too low and it would be concluded, that drug “X” is effective, in reducing the blood pressure in experimental subjects.
Since, in our case study the p value is well below 0.05 (it was 0.0004), the null hypothesis was rejected and alternate hypothesis was accepted. The final conclusion was that that drug “X” was effective, in reducing the blood pressure in experimental subjects.
Correlation coefficient is used to evaluate whether there is a relationship between two variables. The correlation coefficient can be positive or negative. A positive correlation indicates that increasing the value of one variable will increase the value of the other variable. On the other hand, negative correlation indicates that increasing the value of one variable will decrease the value of the other variable (Health 123-154). For example, the correlation coefficient estimated in the case study for the article may be considered for age and pre-treatment systolic blood pressure. The correlation coefficient estimated in the case study for the article was 0.41. This indicates a positive correlation exists between systolic blood pressure and age.
Although correlation coefficient can help to indicate the direction of relation between two variables, it cannot predict the value of one variable from the likely value of the other correlated variable (Health 123-154). Hence, regression equations are constructed to estimate the value of one variable from the likely value of the other correlated variable.
For example, in the case study considered age was an independent variable and blood pressure was the dependent variable. This was so hypothesized because changing the value of age might help in predicting the blood pressure of an individual. Therefore, regression equations are constructed between predictor variable (independent variable) and criterion variable (dependent variable) (McKillup 32-38). The regression equation framed as per the present case study was:
Blood pressure = -15.9432 +3.99 Age in years
Discussion and Conclusion
Statistics in biology is very important for the establishment of experimental results in journals and laboratory set ups. Unless true data and results are implicated, it would lead to improper analysis and may become detrimental leading to wastage of money spend on research or from the clinical aspects of a patient.
Altman, Douglas G. Practical Statistics for Medical Research. New York, USA:
Chapman & Hall/CRC, 1999: pp 167. Print
Cumming, Geoff. Understanding The New Statistics: Effect Sizes, Confidence Intervals, and
Meta-Analysis. New York, USA: Routledge, 2012: pp 27–28.Print
Craparo, Robert M.. “Significance level”. In Salkind, Neil J.Encyclopedia of Measurement
and Statistics 3. Thousand Oaks, CA: SAGE Publications, 2007: pp 889–891. Print.
Health, David. An Introduction To Experimental Design And Statistics For Biology (1st ed.).
Boston, MA: CRC press, 1995: pp 123–154. Print
Myers, Jerome L.; Well, Arnold D.; Lorch Jr & Robert F. (2010). “The t distribution and its
applications”. Research Design and Statistical Analysis: Third Edition (3rd ed.).
New York, NY: Routledge, 2010: pp 124–154. Print
McKillup, Steve. Statistics Explained: An Introductory Guide for Life Scientists (1st ed.).
Cambridge, UK: Cambridge University Press., 2006: pp 32–38. Print
Redmond, Carol; & Colton, Theodore. “Clinical significance versus statistical
significance”. Biostatistics in Clinical Trials. Wiley Reference Series in Biostatistics
(3rd ed.). West Sussex, United Kingdom: John Wiley & Sons Ltd, 2001:pp 35–36.
Vaughan, Simon. Scientific Inference: Learning from Data (1st ed.). Cambridge, UK:
Cambridge University Press, 2013: pp 146–152