23rd October 2023

DBSCAN is an acronym for a clustering algorithm that arranges data points based on their proximity and density, which differs from approaches such as k-means, which require the user to predetermine the cluster count. DBSCAN, on the other hand, examines the data to identify areas of high density and separates them from sparser regions. This is accomplished by tracing a neighborhood around each data point and grouping them into the same cluster when a significant number of points cluster tightly together, indicating high density. Data points in low density locations that do not fit into any cluster are considered noise. This property makes DBSCAN particularly useful for finding clusters of various forms and sizes, as well as managing datasets with inherent noise.

K-means is a clustering technique that divides a dataset into a set number of clusters or groups. The technique starts by randomly selecting ‘k’ beginning points known as ‘centroids.’ Each data point is then assigned to the nearest centroid, and new centroids are generated as the average of all points inside the cluster using these assignments. This process of allocating data points to the nearest centroid and updating centroids is repeated until the centroids vary just slightly. The result is ‘k’ clusters in which data points within the same cluster are closer to one another than points in other clusters. It is essential for the user to specify the ‘k’ value in advance, representing the desired number of clusters.

21st oct 2023

I am doing an analysis of fatal police shootings for the project, I ran an experiment to look at the age distribution of people killed by police.” I constructed a histogram and carefully picked acceptable bin sizes to appropriately portray the age groups and their corresponding frequencies. Several significant conclusions emerged from this investigation. The majority of events involved people between the ages of 15 and 42, with the biggest number of incidences occurring between the ages of 24 and 29. In comparison to occurrences involving young adults, incidents involving youngsters aged 2 to 15 years were quite infrequent. Furthermore, the data showed a significant decrease in events as age climbed above 42 years, showing a downward trend.2

18th October 2023

Our investigation today aimed to identify any possible differences in the number of police shootings involving Black and White people. First, we extracted the most important statistical parameters for both datasets: minimum, maximum, mean, median, standard deviation, skewness, and kurtosis. These measurements gave us the fundamental knowledge that served as the basis for our later histogram-based graphics. Notably, we found that the age profiles of both Black and White victims of police shootings deviated from the normal distribution when age was taken into account. The non-normality of the data raised doubts about the appropriateness of using the t-test to determine the p-value; aware of this constraint, we chose to estimate the p-value using the Monte Carlo approach. We used Cohen’s d method to calculate the magnitude of this difference, and the result was a value of 0.577, which indicates a medium effect size and highlights a notable and substantial difference between these two demographic groups.

 

16TH October 2023

After closely analyzing I discovered the following points:-

The map highlights areas with a higher frequency of police shootings, particularly in urban regions. This heightened concentration could be attributed to factors such as increased population densities or elevated crime rates in these urban areas. In contrast, rural areas and certain states exhibit fewer instances of police shootings, possibly linked to their lower population densities or a reduced need for police intervention.

Interestingly, certain states appear to have a disproportionately high number of shootings relative to their size and population. A more in-depth analysis comparing the number of shootings to each state’s population could offer valuable insights into which states experience a higher or lower number of such incidents.

Moreover, major cities show a higher incidence of shooting events. This correlation may be influenced by a combination of factors, including the greater population density in urban centers, a heightened police presence, and socio-economic factors specific to these urban environments.

 

 

13th October 2023

I performed an in-depth Exploratory data analysis on the Washington shooting dataset today, which has 8,770 records and 19 columns.

The second initiative is evaluating data from the Washington Post data repository, with a particular emphasis on deadly police shootings in the United States. I’ve started with simple operations like “describe()” and “info()” to get a sense of the data and its features. The data have some missing values, and the total shootings are 8770, spanning the time span from 2015 to 2023 (current date). The data includes 51 states, 3374 cities, and 3417 police departments. Today i have analyzed up to that point.

 

11th October 2023

In todays class, we have started the analyzing the data of prohect-2, and we have discussed some queries regarding the data. The Washington Post maintains a database of fatal police shootings in the United States. This data set contains records dating back to January 2, 2015. The information is updated once a week. I performed a rudimentary examination and discovered that there are numerous missing data.

As I initiated a preliminary analysis, a noticeable challenge has surfaced: the presence of numerous missing values, particularly in critical parameters like ‘flee,’ ‘age,’ and ‘race.’ This gap in data poses an important question – how do we effectively address these gaps and proceed with the project?

In the next days, I will do a thorough study to determine any relationships between the characteristics in order to obtain a more complete picture.

6th Otober 2023

Today I completed the entire coding part of my project and also started working on the report writing. In that project, we got an R-square value of 0.34073967115731396, i.e., only 34%  we can predict, so if we want to predict more than that, we have to consider some more aspects to get a good model, and we also drew the graph to show the relationship between diabetes vs. inactivity and obesity. In that graph, both obesity and inactivity are placed at one location only. 

4th october 2023

During today’s session, we primarily centered our attention on our project. We allocated a substantial amount of time to tackle queries and issues associated with our project. This conversation enabled us to resolve any doubts and ensure that everyone had a clear understanding of the project’s Issues, Findings, and Method A,B,C.

We initiated the actual implementation phase by beginning to write the project report. In that report, we completed the part about issues and findings today, and in issues, we mention what basis on we are analyzing the data on and from where we get the data, and in finding, we wrote that we got the value while we were analyzing the data to determine whether the data is relevant to predicting diabetes.

2nd October 2023

In today’s class, we discussed the format of the report and what concepts we needed to include in the report.

Regarding project

We have done the visualization part. That visualization shows the relationship between the independent variables and dependent variables for the test data. The independent variables are obesity and inactivity, and the dependent variable is diabetes. We have drawn the plots between obesity and diabetes and inactivity and diabetes. Also, I got the R^2 value and the mean squared error value. The R^2 value is 0.395; this represents the proportion of the variance for the dependent variable that is explained by the independent variable in the model. The R^2 ranges from 0 to 1, with higher values indicating a better fit. The MSE value is -0.400063. This represents the average of the squares of the error between the predicted and actual values. Lower values are better, but the scale depends on the dependent variable.