# Statistic Project

Over the years, I have believed, that there is a strong correlation between age of an individual and his or her contribution towards household income. I decided to conduct a survey regarding the issue and recorded my findings as stated in this article. The overall aim of the project was to evaluate the correlation of age with household income as from recorded statistics.

Methodology

Data for the proposed analysis was collected through structured interviews. The participants were randomly selected from the neighborhood to eliminate the chances of bias in the sampling method. This means each and every individual in the neighborhood had an equal probability of being considered for the proposed analysis. Such individuals included my immediate neighbors and friends’ of my parents. All participants were visited at their residence for the interview. Telephonic interviews were conducted where a manual visit was not feasible. In this research, the ethical issue associated was to maintain the confidentiality of an individual’s job function. These ethics were maintained, and no individual was interviewed regarding their job profile.

Statistical Analysis

Descriptive statistics were used to find out the income according to different age groups. This included evaluation of mean, median and mode of income with respect to age groups. A regression equation was plotted to understand the relation between magnitudes of household income with respect to the magnitude of the age of a person. The regression equation was only carried out after evaluation of the correlation coefficient. A correlation coefficient is considered to be significant if the value is more than 0.5.

Results and Analysis

The results were plotted in relation to the survey questions as given below.

Which age group had the highest household income?

What is the median/average income for an individual age group considered for the study?

The survey questions were answered from the historical data chart as shown in figure below:

The age group between 55-64 years had a higher income than others. According to the findings from the interview, the lowest household income was under 35 years of age, and the highest income was a feature in the age-group of 35-44 years. These findings complemented the findings in the historical chart. It also jeopardized the fact, that the age group between 55-64 years, need not necessarily had a higher income than lower age groups. However, the survey does indicate, that the age group might have some relationship, with the household income.

Historical data was complemented with collected interview data as shown below for plotting regression equation and comparing income for various age groups.

.

lefttop

Modified box plots and Histograms: Descriptive Statistics of Collected data

Age Chart

Statcrunch software was used to obtain the statistical results. Quartiles, Mean, Median, and Mode, were analyzed. The above data indicated that income was lowest in the age group of 24.5 years and was highest in the age group of 43.5 years.

From the histograms of household income and age, it reflected that histograms were skewed to the right. However, both the distributions were unimodal. In the household income, there were no individuals in the sample who had income in the range of 150000-175000 dollars. Further, in the household income there were two outliers.

From the summary statistics chart as reflected earlier also, the mean salary at the mean age group of 36.9 years was $72725/ annum. However, the IQR was better to describe the spread, because the Std.dev was high was high for the mean. The median was best to describe, since both the histograms of household income and age were skewed.

Linear Regression and Correlation: Scatterplot Analysis

For the Scatter Plot the variables that were considered were: X-variable for age and Y-variable for household income.

This form is linear, strength is moderate, and the direction was positive. There were two outliers at age 40 and age 60. The Linear regression equation was plotted and evaluated as the outliers were equidistant from the slope of the equation.

Correlation Coefficient and Linear Regression:

Correlation coefficient between Age and Household Income was:r=0.548. The linear regression equation was plotted with income as the dependent variable and age as the independent variable. Scattergram was also plotted after correcting for outliers. The equation is shown as below:

Household Income = -730.96886 + 1990.68 Age

Fig: Scattergram with regression equation after correcting for outliers. The equation was significant at p< 0.05.

Discussion & Conclusion

The result of the correlation coefficient and linear regression endorses and supports my hypothesis that, the age group is strongly correlated to household income. This is because the correlation coefficient was more than 0.5, and it indicates that age group were significantly correlated in a positive way with household income. Thus when age group of an individual increases his contribution towards household income also increases. Moreover, the regression equation was significant at p < 0.05, which indicates out of 100 observations more than 5 observations may have a relation of age with household income (Rencher & Christensen 12). On the contrary in rest 95% cases, age is certainly related to household income and further, household income can be predicted from the regression equation. The outliers that existed in the data was smoothened due to equal probability of outliers in higher end and lower end of the sample collected which got nullified in the regression statistics, since the regression equation was statistically significant (Rencher & Christensen 12). The sampling although was a randomized one. As stated, the participants were randomly selected from the neighborhood to eliminate the chances of bias in the sampling method. This means each and every individual in the neighborhood had an equal probability of being considered for the proposed analysis. Such individuals included my immediate neighbors and friends’ of my parents. However, the distribution was tailed, and Markov’s correction could have been employed to arrive at a normal distribution. To eliminate bias, structured data collection could be employed through purposive sampling and then engaging a stratified random sampling. Apart from age the nature of the job does influence household income, however due to ethical issues I refrained from undertaking such analysis (Rencher & Christensen 12).

Technological Considerations

Histograms and Box Plot were plotted to estimate the descriptive statistics and the frequency distribution of the sample collected in the analysis. The histogram provides the necessary comparison data of one age group versus another while Box Plot help us to identify the range of data and the probable outliers, that may add to the bias in results obtained. The scatter plot and regression equations were done through R-programming software. Although the dependent and independent variables that we selected were optimum for the study, however, to have a significant conclusion an added independent variable (qualitative and dummy variable) could have been considered on the nature of job.

References

Rencher., A., & Christensen., W. (2012), “Chapter 10, Multivariate regression –

Section 10.1, Introduction”, Methods of Multivariate Analysis. Wiley Series in

Probability and Statistics, 709 (3rd ed.), John Wiley & Sons, 2012, p. 19. Print.

## Leave a Reply