Project-3 Report
8th December 2023
Today, our primary focus is on implementing anomaly detection for our economic indicator dataset. Anomaly detection is a powerful statistical approach aimed at uncovering irregular patterns that deviate from expected behavior, and these outliers can often yield valuable insights. In essence, you can think of this process as akin to finding needles in a haystack. In the context of our economic data, these ‘needles’ may represent unusual spikes or dips in indicators such as unemployment rates or hotel occupancy. Detecting these anomalies holds significant importance as they could potentially signify significant economic events, shifts, or even errors in the data collection process. To accomplish this task, we are utilizing the Isolation Forest method, an advanced algorithm well-suited for identifying anomalies within intricate datasets. This technique proves especially effective when dealing with large, multidimensional data, aligning perfectly with our specific objectives
4th december 2023
The “economic indicator” dataset presents a comprehensive collection of economic factors, organized by year and month, offering a snapshot of various economic dimensions. It includes data points such as the number of passengers and international flights at Logan Airport, hotel occupancy rates and average daily rates, total employment figures, and the unemployment rate. Additionally, it encompasses the labor force participation rate, detailed statistics on housing or building projects (including unit counts, total development costs, square footage, and construction-related employment), and insights into the real estate market through foreclosure petitions and deeds, median housing prices, sales volumes, and permits issued for new housing, with a specific focus on affordable housing. This dataset serves as a valuable resource for analyzing key aspects of the economy, encompassing sectors like air travel, hospitality, employment, and real estate, thereby offering insights into the financial stability and trends within a specific region.
29th November 2023
Today, we’re looking at the economic indicator, which has a fascinating interplay of variables. For example, I’m looking into how passenger volume at Logan Airport may be used as a barometer for hotel occupancy rates, providing insight into the state of tourist and business travel. Another intriguing area of focus is the interaction between the job economy and the housing sector. A robust job market frequently generates strong home demand, whereas a slow employment market might lead to a drop in real estate activity. Furthermore, the impact of significant development projects on local economies is noteworthy, demonstrating how such initiatives can drive job creation and revitalize the housing market. This essay will untangle these economic strands, illustrating how changes in one area can affect others.
27th November 2023
Today, we delve into the housing market’s dynamics, with a particular focus on the evolution of median housing prices and how they reflect broader economic patterns. Our analysis, akin to a roadmap, reveals the market’s fluctuating highs and lows. Rising prices often indicate a robust economy and high housing demand, signaling buyer confidence, while price dips or plateaus might suggest a cooling market due to economic shifts or changing buyer sentiments. These trends are intertwined with broader economic indicators such as employment and interest rates; for instance, a strong job market can increase home buying capacity, pushing prices up, whereas fluctuating interest rates can influence buyer enthusiasm. We also noted potential seasonal trends in the market, suggesting times of year with more activity that subtly impact prices. Understanding these nuances is crucial, offering insights not just into the real estate sector but the broader economy, providing valuable information for buyers, sellers, investors, and policymakers in an ever-evolving landscape.
20th November 2023
The SARIMA (Seasonal Autoregressive Integrated Moving Average) model, an extension of the ARIMA model, is a foundational tool in time series analysis, particularly adept at handling data with seasonal patterns. Comprising Seasonal (S), Autoregressive (AR), Integrated (I), and Moving Average (MA) components, SARIMA captures the seasonality in data, models the relationship between observations and lagged values, integrates differencing for stationarity, and accounts for residual errors from moving averages. The seasonal component is vital for capturing recurring patterns, while the autoregressive component accounts for lagged correlations, the integrated component involves differencing for stationarity, and the moving average component captures short-term fluctuations. The synergy of these components, with their respective orders denoted by parameters like p, d, q, P, D, Q, and m, enables SARIMA’s versatility in predicting future points in time series data, making it especially useful for forecasting in scenarios with seasonal variations. Selecting appropriate orders is crucial in fitting SARIMA models, often guided by autocorrelation and partial autocorrelation plots and an understanding of the data’s seasonal characteristics. Overall, SARIMA serves as a sophisticated and effective tool for time series forecasting and pattern analysis.
17th November 2023
Time series analysis, a fundamental aspect of data science, revolves around the examination of sequentially recorded data points, providing valuable insights across diverse domains such as economics and meteorology. This method, integral for predicting future trends based on historical data, is pivotal in uncovering meaningful statistics, identifying patterns, and facilitating forecasts. The core concepts encompass trend analysis, aimed at recognizing long-term movements, seasonality for pattern identification, noise separation to isolate random variability, and stationarity, assuming consistent statistical properties over time. Employing techniques like descriptive analysis for visual inspection, moving averages to smooth short-term fluctuations and emphasize longer-term trends, and ARIMA models for forecasting, time series analysis plays a crucial role in predicting market trends, optimizing weather forecasts, and enabling strategic business planning. With the evolution of the field, machine learning approaches such as Random Forests and Neural Networks are increasingly integrated, offering robust solutions for intricate time series forecasting challenges.
15th November 2023
In today’s time series analysis class, I learned about key elements and methods crucial for interpreting temporal data. We explored trend analysis, focusing on recognizing whether data exhibits a rising, falling, or constant trend over time. Another important aspect covered was seasonality, which involves understanding and adjusting for repetitive patterns at regular intervals, such as weekly, monthly, or yearly occurrences. Additionally, the concept of stationarity was emphasized, highlighting that a time series is considered stationary when statistical properties remain consistent over time, a prerequisite for many models. Finally, we discussed popular models like ARIMA, SARIMA, and advanced machine learning models like LSTM networks, providing valuable tools for forecasting, pattern recognition, and analyzing the impact of different factors on time-related data.
13th November 2023
In today’s class, we discussed about time series analysis. Time series analysis can be used to forecast parameters in the future by establishing trends from existing data. In today’s session, we saw how time series analysis could have been applied to police shooting data and how shooting trends could have been studied. A second application was demonstrated using data from economic indicators. Time series analysis revealed that comparable trends in hotel prices in different months were seen throughout a 7-year period (2013-2019). We also discussed the new dataset, and I checked it for economic indications. The Boston Redevelopment Authority offered this dataset, which included several economic variables.
10th November 2023
Logistic Regression
Logistic regression is a statistical technique that is commonly used for problems involving binary classification, where the outcomes are dichotomous, such as yes/no or true/false. Unlike linear regression, which predicts continuous outcomes, logistic regression predicts the likelihood of a given input falling into a certain class. This is performed by using the logistic (or sigmoid) function, which converts the answer of a linear equation into a probability value between 0 and 1. Forecasting the likelihood of a patient having a specific ailment in the medical sector, predicting customer turnover in marketing, and determining credit scores in finance are all common applications. Despite being very simple to execute and analyze, and being effective for linearly separable data, logistic regression requires a linear relationship.
Report writing
We’ve initiated the project report and as of today, we’ve concluded the sections covering issues, discussions, and results.
3rd November
Asian Descent Victims (Category A)
Typically, individuals in this category are around 36 years of age. However, there’s a significant age range amongst them. Based on our data, we’re 95% certain that the average age is somewhere between 34 and 38, indicating a broad age distribution.
African Descent Victims (Category B)
The average age for this group is 33. Like the Asian category, there’s noticeable age variability. With a confidence level of 95%, we estimate the age bracket to be between 32 and 33, indicating a more confined age group.
Victims of Hispanic Background (Category H)
On average, victims in this group are 34 years old. Their age spread seems to be quite similar to the African descent category, with a 95% confidence interval between 33 and 34 years.
Victims from Native American Background (Category N)
For this group, the median age hovers around 33, but with a 95% confidence range of 31 to 34 years, highlighting a wider age variation compared to some other categories.
Victims from Various Ethnic Backgrounds (Category O)
This diverse category has an average age of roughly 34 years. The range in age is quite extensive, with a 95% confidence interval from 28 to 39, showcasing the vast age differences within this category.
Victims of European Descent (Category W)
The majority of victims in this group are nearing 40 in age. Their ages, on the whole, seem to be fairly uniform, with a 95% confidence estimate indicating an age bracket of 40 to 41.
1st November 2023
During our analysis of fatal confrontations with law enforcement, we delved into the interplay between factors like age, racial background, and threat levels with signs of psychological concerns.
Our investigation pinpointed a distinct association between age and markers of mental health. The t-test highlighted a significant age variation among individuals exhibiting or not exhibiting mental health symptoms, producing a t-statistic of 8.51 and an almost negligible p-value. This underscores the profound linkage between age and mental health anomalies in these scenarios.
Regarding racial background, we navigated through initial data anomalies and subsequently executed a chi-square analysis. This analysis showcased a considerable connection between racial background and markers of mental health, yielding a chi-square value of 171.23 and a p-value around 3.98×10^-35.
In a similar vein, there was a marked relationship between the level of threat assessed and indicators of mental health, as shown by a chi-square value of 24.48 and a p-value approaching 4.82×10^-6.
23rd October 2023
DBSCAN is an acronym for a clustering algorithm that arranges data points based on their proximity and density, which differs from approaches such as k-means, which require the user to predetermine the cluster count. DBSCAN, on the other hand, examines the data to identify areas of high density and separates them from sparser regions. This is accomplished by tracing a neighborhood around each data point and grouping them into the same cluster when a significant number of points cluster tightly together, indicating high density. Data points in low density locations that do not fit into any cluster are considered noise. This property makes DBSCAN particularly useful for finding clusters of various forms and sizes, as well as managing datasets with inherent noise.
K-means is a clustering technique that divides a dataset into a set number of clusters or groups. The technique starts by randomly selecting ‘k’ beginning points known as ‘centroids.’ Each data point is then assigned to the nearest centroid, and new centroids are generated as the average of all points inside the cluster using these assignments. This process of allocating data points to the nearest centroid and updating centroids is repeated until the centroids vary just slightly. The result is ‘k’ clusters in which data points within the same cluster are closer to one another than points in other clusters. It is essential for the user to specify the ‘k’ value in advance, representing the desired number of clusters.
21st oct 2023
I am doing an analysis of fatal police shootings for the project, I ran an experiment to look at the age distribution of people killed by police.” I constructed a histogram and carefully picked acceptable bin sizes to appropriately portray the age groups and their corresponding frequencies. Several significant conclusions emerged from this investigation. The majority of events involved people between the ages of 15 and 42, with the biggest number of incidences occurring between the ages of 24 and 29. In comparison to occurrences involving young adults, incidents involving youngsters aged 2 to 15 years were quite infrequent. Furthermore, the data showed a significant decrease in events as age climbed above 42 years, showing a downward trend.2
18th October 2023
Our investigation today aimed to identify any possible differences in the number of police shootings involving Black and White people. First, we extracted the most important statistical parameters for both datasets: minimum, maximum, mean, median, standard deviation, skewness, and kurtosis. These measurements gave us the fundamental knowledge that served as the basis for our later histogram-based graphics. Notably, we found that the age profiles of both Black and White victims of police shootings deviated from the normal distribution when age was taken into account. The non-normality of the data raised doubts about the appropriateness of using the t-test to determine the p-value; aware of this constraint, we chose to estimate the p-value using the Monte Carlo approach. We used Cohen’s d method to calculate the magnitude of this difference, and the result was a value of 0.577, which indicates a medium effect size and highlights a notable and substantial difference between these two demographic groups.
16TH October 2023
After closely analyzing I discovered the following points:-
The map highlights areas with a higher frequency of police shootings, particularly in urban regions. This heightened concentration could be attributed to factors such as increased population densities or elevated crime rates in these urban areas. In contrast, rural areas and certain states exhibit fewer instances of police shootings, possibly linked to their lower population densities or a reduced need for police intervention.
Interestingly, certain states appear to have a disproportionately high number of shootings relative to their size and population. A more in-depth analysis comparing the number of shootings to each state’s population could offer valuable insights into which states experience a higher or lower number of such incidents.
Moreover, major cities show a higher incidence of shooting events. This correlation may be influenced by a combination of factors, including the greater population density in urban centers, a heightened police presence, and socio-economic factors specific to these urban environments.
13th October 2023
I performed an in-depth Exploratory data analysis on the Washington shooting dataset today, which has 8,770 records and 19 columns.
The second initiative is evaluating data from the Washington Post data repository, with a particular emphasis on deadly police shootings in the United States. I’ve started with simple operations like “describe()” and “info()” to get a sense of the data and its features. The data have some missing values, and the total shootings are 8770, spanning the time span from 2015 to 2023 (current date). The data includes 51 states, 3374 cities, and 3417 police departments. Today i have analyzed up to that point.
11th October 2023
In todays class, we have started the analyzing the data of prohect-2, and we have discussed some queries regarding the data. The Washington Post maintains a database of fatal police shootings in the United States. This data set contains records dating back to January 2, 2015. The information is updated once a week. I performed a rudimentary examination and discovered that there are numerous missing data.
As I initiated a preliminary analysis, a noticeable challenge has surfaced: the presence of numerous missing values, particularly in critical parameters like ‘flee,’ ‘age,’ and ‘race.’ This gap in data poses an important question – how do we effectively address these gaps and proceed with the project?
In the next days, I will do a thorough study to determine any relationships between the characteristics in order to obtain a more complete picture.
Report For Project1
code for the project: Project code.
6th Otober 2023
Today I completed the entire coding part of my project and also started working on the report writing. In that project, we got an R-square value of 0.34073967115731396, i.e., only 34% we can predict, so if we want to predict more than that, we have to consider some more aspects to get a good model, and we also drew the graph to show the relationship between diabetes vs. inactivity and obesity. In that graph, both obesity and inactivity are placed at one location only.
4th october 2023
During today’s session, we primarily centered our attention on our project. We allocated a substantial amount of time to tackle queries and issues associated with our project. This conversation enabled us to resolve any doubts and ensure that everyone had a clear understanding of the project’s Issues, Findings, and Method A,B,C.
We initiated the actual implementation phase by beginning to write the project report. In that report, we completed the part about issues and findings today, and in issues, we mention what basis on we are analyzing the data on and from where we get the data, and in finding, we wrote that we got the value while we were analyzing the data to determine whether the data is relevant to predicting diabetes.
2nd October 2023
In today’s class, we discussed the format of the report and what concepts we needed to include in the report.
Regarding project
We have done the visualization part. That visualization shows the relationship between the independent variables and dependent variables for the test data. The independent variables are obesity and inactivity, and the dependent variable is diabetes. We have drawn the plots between obesity and diabetes and inactivity and diabetes. Also, I got the R^2 value and the mean squared error value. The R^2 value is 0.395; this represents the proportion of the variance for the dependent variable that is explained by the independent variable in the model. The R^2 ranges from 0 to 1, with higher values indicating a better fit. The MSE value is -0.400063. This represents the average of the squares of the error between the predicted and actual values. Lower values are better, but the scale depends on the dependent variable.
29th September 2023
We conducted a study utilizing a dataset specifically designed to investigate three critical factors impacting an individual’s overall health: obesity, physical inactivity, and diabetes. Our analysis involved gathering extensive data from 354 dataset points, encompassing detailed measurements for each of these variables.
To assess our data, we divided the 354 datasets into five equal portions, although the choice of five is not fixed and can be adjusted as needed. Each of these portions consisted of 71 data points, except for one with 70 data points. Four of these segments were utilized for training our model, while the remaining part was reserved for testing its performance. We repeated this process five times, each time employing a different section for testing.
Furthermore, we evaluated how effectively our model aligned with the entire dataset. To accomplish this, we trained the model on the entire dataset and assessed its performance based on its ability to predict actual results.
27th September 2023
K-Fold Cross-Validation:
K-fold Cross-validation is a commonly employed method for assessing the test error of a predictive model. The basic concept involves randomly dividing the dataset into K equal-sized segments or folds. In each iteration, one of these segments (referred to as part K) is left out, while the model is trained using the remaining K-1 segments. Predictions are then generated using the omitted Kth segment. This process is repeated for each segment, with K taking values from 1 to K. The outcomes are then combined. As each training set is just (K-1)/K the size of the original training set, it often results in an upward bias in prediction error estimates. Although this estimate’s variance can be substantial, it is minimized when K is equal to the total number of data points (K = n).
Distinguishing Between Test and Training Errors:
Test error represents the typical error that arises when utilizing a statistical learning technique to forecast outcomes for fresh observations that were not part of the model’s training. Conversely, training error can be effortlessly computed by employing the same technique on the data that was utilized during the model’s training phase. It is important to recognize that the training error rate frequently diverges considerably from the test error rate, with the former often giving rise to an underestimated estimate of the latter.
25th September 2023
In today’s class professor explained about the sampling methods, the sampling methods are two types cross-validation and the bootstrap. These methods refit a model of interest to sample formed from the training set, in order to obtain additional information about the fitted model.
Test error: The test error is the average error that results from using a statistical learning method to predict the response on anew observation, on that was not used in training the method.
Training error: The training error can be easily calculated by applying the statistical learning method to the observations used in its training, But the training error rate often is quite different from the test error rate, and in particular the former can dramatically underestimate the latter.
The Validation Set Approach is a valuable method for estimating test error, but it comes with certain limitations stemming from variability and the risk of potential model underfitting. Caution should be exercised when interpreting its findings, particularly when deploying the model on the entire dataset
22nd September 2023
The correlation between %diabetes, % inactivity, and %obesity
For my project, the equation for multiple regression can be given as
Y =β0 +β1X1 +β2X2….
Y represents the percentage of individuals , with diabetes, while X1 denotes the percentage of people who are active , and X2 represents the percentage of individuals who are obese.
When attempting to determine the correlation between the percentage of individuals with diabetes (%diabetics) and a single variable, specifically the percentage of inactivity (%inactivity), we find that Pearson’s R-squared is approximately 0.1952. In this context, it can be stated that there is roughly a 20% correlation between these two variables.
At the outset, when constructing a linear model incorporating two variables, namely x1 (representing inactivity) and x2 (representing obesity), the R-squared value for this model is approximately 34%. However, the situation takes an intriguing turn from here.
If we attempt the same procedure, with the key distinction being that we center the variables before constructing the linear model, the resulting R-squared value for this model is approximately 36%. In this instance, it becomes evident that there has been an increase of approximately 2% in the R-squared value compared to the previous approach.
20th September 2023
In today’s class, we were introduced to the Crab Molt Model, which serves as a powerful linear modeling technique tailored for scenarios where two variables demonstrate characteristics such as non-normal distribution, skewness, elevated variance, and high kurtosis. The primary objective of this model is to make predictions regarding pre-molt size using information about post-molt size.
we also covered the concept of statistical significance, with a specific focus on disparities in means. Utilizing data from the textbook “Stat Labs: Mathematical Statistics Through Applications,” Chapter 7, page 139, we constructed a model and generated a linear plot. When we plotted graphs representing post-molt and pre-molt sizes, we noted a significant difference in their means. Interestingly, the size and shape of these graphs displayed a striking similarity, differing by only 14.68 units.
Pre-molt data denotes measurements or observations made prior to a particular event, whereas post-molt data pertains to measurements or observations taken subsequent to that event. These terms are frequently employed to analyze variations or discrepancies in variables occurring before and after a significant transformation or occurrence.
The Crab Molt Model and the utilization of t-tests to examine differences in means serve as valuable tools for unraveling intricate data intricacies. However, when dealing with complex scenarios involving multiple variables, it becomes imperative to embrace advanced statistical methods to delve deeper into the data and enhance our comprehension of statistical significance.
18th September 2023
Multiple Linear Regression:
A linear regression model with more than one predictor variable is called multiple linear regression. In multiple linear regression, we have one dependent variable and multiple dependent variables. That dependent variable is what we are trying to predict. The main motto of this model is to identify the relationship between dependent variables and independent variables.
The Equation for Multiple Linear Regression:
Y=β0+β1X1+β2X2+…+βpXp+ε
Today’s class professor gave an example of multiple linear regression, in which Y is the dependent variable. And X1, X2 are the independent variables, and Y is for diabetes, X1 for inactivity, and X2 for obesity.
OVERFIT:
The model is a very good (ever perfect) fit to the data but behaves poorly with new data.
15th September 2023
The initial stage in data analysis is to ensure the linkage of precise and unambiguous data. We used Python Numpy to execute critical statistical operations such as computing medians, means, and standard deviations using data from a project sheet on diabetics, inactivity, and obesity. These computations provide us with a rudimentary understanding of the dataset.
The primary goal we had was to show a correlation between the percentage of people with diabetes and the relation of those who are sedentary. To achieve this, we designed a scatter graph in which each region is a data point. This visual aid was very helpful in assessing the relationship between these two variables. The R-squred values, a statistic that measures the strength of this relationship, were then computed using the scatter graph.
Today’s class, professor addressed many queries by the students about the dataset, which helped me understand the upcoming phases of analysis. The lecturer pointed out that for this dataset, non-linear models are able to be applied, that might lead to a high R-squared value. To my query on his proposal on applying changes to the variables, the instructor provided an example that for datasets which has a highly skewed distribution, a log transformation could aid to get them regularly distributed. meanwhile our dataset is almost normally distributed, he recommended against implementing modifications on the parameters in this dataset.
What is the p-value?
I attended the second class of MTH522. During this class, the topic of discussion was the value of “p” in probability and Fair coin
What is the p-value?
The p-value quantifies the probability of observing a test statistic as extreme as the one computed from the sample data, assuming that the null hypothesis is true. A small p-value (typically less than 0.05) indicates strong evidence against the null hypothesis, while a large p-value suggests weak evidence.
Fair coin:
When flipped , the coin has an equal probability of landing on either the heads or tails side. Accordingly , both the likelihood of receiving heads and tails is equally likely to be 0.5.
Null Hypothesis:
For the purpose of statistical hypothesis testing, the null hypothesis is a claim that assumes there is no significant difference.
Simple Linear Regreesion
Today marked the inaugural session of MTH522, a captivating journey into the world of statistics and data analysis. In this initial class, we delved into the fundamentals of simple linear regression, an indispensable statistical technique that forms the bedrock of predictive modeling. The session was enriched with real-world relevance as we explored CDC reports on diabetics.
Throughout the class, we were tasked with deciphering intricate graphs and charts, embarking on an insightful journey to unravel the stories hidden within the data. With each graph we analyzed, we honed our skills in deciphering trends, identifying outliers, and extracting meaningful insights.
In the weeks to come, MTH522 promises to be a fascinating exploration of statistical concepts, data interpretation, and practical applications. With each class, we’ll continue to deepen our understanding of statistical techniques, enabling us to make informed decisions, solve real-world problems, and ultimately, harness the power of data to drive innovation and change.
Hello world!
Welcome to UMassD WordPress. This is your first post. Edit or delete it, then start blogging!