My initial steps in working with the two CSV files, ‘fatal-police-shootings-data’ and ‘fatal-police-shootings-agencies’, involved loading them into Jupyter Notebook. Here’s a summary of the steps and challenges I encountered:
- Loading Data: I began by loading the two CSV files into Jupyter Notebook. The ‘fatal-police-shootings-data’ dataframe contains 8,770 instances and 19 features, while the ‘fatal-police-shootings-agencies’ dataframe has 3,322 instances and 5 features.
- Data Column Alignment: After examining the column descriptions on GitHub, I realized that the ‘ids’ column in the ‘fatal-police-shootings-agencies’ dataframe is equivalent to the ‘agency_ids’ in the ‘fatal-police-shootings-data’ dataframe. Therefore, I modified the column name from ‘ids’ to ‘agency_ids’ in the ‘fatal-police-shootings-agencies’ dataframe to facilitate merging.
- Data Type Mismatch: When I attempted to merge the two dataframes using ‘agency_ids,’ I encountered an error indicating that I couldn’t merge on a column with different data types. Upon inspecting the data types using the ‘.info()’ function, I discovered that one dataframe had the ‘agency_ids’ column as an object type, while the other had it as an int64 type. To address this, I used the ‘pd.to_numeric()’ function to ensure that both columns were of type ‘int64’.
- Data Splitting: I encountered a new challenge in the ‘fatal-police-shootings-data’ dataframe: the ‘agency_ids’ column contained multiple IDs in a single cell. To proceed, I am in the process of splitting these cells into multiple rows.
Once I successfully split the cells in the ‘fatal-police-shootings-data’ dataframe into multiple rows, I plan to delve deeper into data exploration and commence data preprocessing. This will involve tasks such as cleaning, handling missing data, and preparing the data for analysis or modeling. Your journey into data analysis and preprocessing seems to be off to a good start, and handling these challenges will help you gain valuable insights from the data.