29th September 2023

We conducted a study utilizing a dataset specifically designed to investigate three critical factors impacting an individual’s overall health: obesity, physical inactivity, and diabetes. Our analysis involved gathering extensive data from 354 dataset points, encompassing detailed measurements for each of these variables.

To assess our data, we divided the 354 datasets into five equal portions, although the choice of five is not fixed and can be adjusted as needed. Each of these portions consisted of 71 data points, except for one with 70 data points. Four of these segments were utilized for training our model, while the remaining part was reserved for testing its performance. We repeated this process five times, each time employing a different section for testing.

Furthermore, we evaluated how effectively our model aligned with the entire dataset. To accomplish this, we trained the model on the entire dataset and assessed its performance based on its ability to predict actual results.

27th September 2023

K-Fold Cross-Validation:

K-fold Cross-validation is a commonly employed method for assessing the test error of a predictive model. The basic concept involves randomly dividing the dataset into K equal-sized segments or folds. In each iteration, one of these segments (referred to as part K) is left out, while the model is trained using the remaining K-1 segments. Predictions are then generated using the omitted Kth segment. This process is repeated for each segment, with K taking values from 1 to K. The outcomes are then combined. As each training set is just (K-1)/K the size of the original training set, it often results in an upward bias in prediction error estimates. Although this estimate’s variance can be substantial, it is minimized when K is equal to the total number of data points (K = n).

Distinguishing Between Test and Training Errors:

Test error represents the typical error that arises when utilizing a statistical learning technique to forecast outcomes for fresh observations that were not part of the model’s training. Conversely, training error can be effortlessly computed by employing the same technique on the data that was utilized during the model’s training phase. It is important to recognize that the training error rate frequently diverges considerably from the test error rate, with the former often giving rise to an underestimated estimate of the latter.

25th September 2023

In today’s class professor explained about the sampling methods, the sampling methods are two types cross-validation and the bootstrap. These methods refit a model of interest to sample formed from the training set, in order to obtain additional information about the fitted model.

Test error: The test error is the average error that results from using a statistical learning method to predict the response on anew observation, on that was not used in training the method.

Training error: The training error can be easily calculated by applying the statistical learning method to the observations used in its training, But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.

The Validation Set Approach is a valuable method for estimating test error, but it comes with certain limitations stemming from variability and the risk of potential model underfitting. Caution should be exercised when interpreting its findings, particularly when deploying the model on the entire dataset

22nd September 2023

The correlation between %diabetes, % inactivity, and %obesity

For my project, the equation for multiple regression can be given as

Y =β0 +β1X1 +β2X2….

Y represents the percentage of individuals , with diabetes, while X1 denotes the percentage of people who are active , and X2 represents the percentage of individuals who are obese.

When attempting to determine the correlation between the percentage of individuals with diabetes (%diabetics) and a single variable, specifically the percentage of inactivity (%inactivity), we find that Pearson’s R-squared is approximately 0.1952. In this context, it can be stated that there is roughly a 20% correlation between these two variables.

At the outset, when constructing a linear model incorporating two variables, namely x1 (representing inactivity) and x2 (representing obesity), the R-squared value for this model is approximately 34%. However, the situation takes an intriguing turn from here.

If we attempt the same procedure, with the key distinction being that we center the variables before constructing the linear model, the resulting R-squared value for this model is approximately 36%. In this instance, it becomes evident that there has been an increase of approximately 2% in the R-squared value compared to the previous approach.

20th September 2023

In today’s class, we were introduced to the Crab Molt Model, which serves as a powerful linear modeling technique tailored for scenarios where two variables demonstrate characteristics such as non-normal distribution, skewness, elevated variance, and high kurtosis. The primary objective of this model is to make predictions regarding pre-molt size using information about post-molt size.

we also covered the concept of statistical significance, with a specific focus on disparities in means. Utilizing data from the textbook “Stat Labs: Mathematical Statistics Through Applications,” Chapter 7, page 139, we constructed a model and generated a linear plot. When we plotted graphs representing post-molt and pre-molt sizes, we noted a significant difference in their means. Interestingly, the size and shape of these graphs displayed a striking similarity, differing by only 14.68 units.

Pre-molt data denotes measurements or observations made prior to a particular event, whereas post-molt data pertains to measurements or observations taken subsequent to that event. These terms are frequently employed to analyze variations or discrepancies in variables occurring before and after a significant transformation or occurrence.

The Crab Molt Model and the utilization of t-tests to examine differences in means serve as valuable tools for unraveling intricate data intricacies. However, when dealing with complex scenarios involving multiple variables, it becomes imperative to embrace advanced statistical methods to delve deeper into the data and enhance our comprehension of statistical significance.

18th September 2023

Multiple Linear Regression:

A linear regression model with more than one predictor variable is called multiple linear regression. In multiple linear regression, we have one dependent variable and multiple dependent variables. That dependent variable is what we are trying to predict. The main motto of this model is to identify the relationship between dependent variables and independent variables.

The Equation for Multiple Linear Regression:

Y=β0​+β1​X1​+β2​X2​+…+βp​Xp​+ε

Today’s class professor gave an example of multiple linear regression, in which Y is the dependent variable. And X1, X2 are the independent variables, and Y is for diabetes, X1 for inactivity, and X2 for obesity.

OVERFIT:

The model is a very good (ever perfect) fit to the data but behaves poorly with new data.

15th September 2023

The initial stage in data analysis is to ensure the linkage of precise and unambiguous data. We used Python Numpy to execute critical statistical operations such as computing medians, means, and standard deviations using data from a project sheet on diabetics, inactivity, and obesity. These computations provide us with a rudimentary understanding of the dataset.

The primary goal we had was to show a correlation between the percentage of people with diabetes and the relation of those who are sedentary. To achieve this, we designed a scatter graph in which each region is a data point. This visual aid was very helpful in assessing the relationship between these two variables. The R-squred values, a statistic that measures the strength of this relationship, were then computed using the scatter graph.

Today’s class, professor addressed many queries by the students about  the dataset, which helped me understand the upcoming phases of analysis. The lecturer pointed out that for this dataset, non-linear models are able to be applied, that might lead to a high R-squared value. To my query on his proposal on applying changes  to the variables, the instructor provided an example that for datasets which has a highly skewed distribution, a log transformation could aid to get them regularly distributed. meanwhile our dataset is almost normally distributed, he recommended against implementing modifications on the parameters in this dataset.

 

What is the p-value?

I attended the second class of MTH522. During this class, the topic of discussion was the value of “p” in probability and Fair coin

What is the p-value?

The p-value quantifies the probability of observing a test statistic as extreme as the one computed from the sample data, assuming that the null hypothesis is true. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis, while a large p-value suggests weak evidence.

Fair coin:

When flipped , the coin has an equal probability of landing on either the heads or tails side. Accordingly , both the likelihood of receiving heads and tails is equally likely to be 0.5.

Null Hypothesis:

For the purpose of statistical hypothesis testing, the null hypothesis is a claim that assumes there is no significant difference.

Simple Linear Regreesion

Today marked the inaugural session of MTH522, a captivating journey into the world of statistics and data analysis. In this initial class, we delved into the fundamentals of simple linear regression, an indispensable statistical technique that forms the bedrock of predictive modeling. The session was enriched with real-world relevance as we explored CDC reports on diabetics.

Throughout the class, we were tasked with deciphering intricate graphs and charts, embarking on an insightful journey to unravel the stories hidden within the data. With each graph we analyzed, we honed our skills in deciphering trends, identifying outliers, and extracting meaningful insights.

In the weeks to come, MTH522 promises to be a fascinating exploration of statistical concepts, data interpretation, and practical applications. With each class, we’ll continue to deepen our understanding of statistical techniques, enabling us to make informed decisions, solve real-world problems, and ultimately, harness the power of data to drive innovation and change.