Data Science Knowledge Base
Aus MattWiki
This article is a knowledge base with basics for how to start a data science project.
Sources:
- openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023
- Numpy and Pandas tutorials and reference
Exploratory Data Analysis
# Load numpy and pandas libraries
import numpy as np
import pandas as pd
# Read data from CSV file into a dataframe
df = pd.read_csv('911.csv')
# Show informations about columns, and number and data type of their content
print(df.info())
# Show first and last rows and columns of the dataframe
print(df)
# Show first 10 columns of dataframe
print(df.head(10))
# Describe numerical columns of dataframe by showing their min, max, count, mean and other:
print(df.describe())
# Analyze columns of interest, i.e. ZIP code, title and timeStamp:
print(df["zip"].mean())
print(df["zip"].value_counts().head(10))
print(df["zip"].value_counts().tail(10))
print(df["zip"].nunique())
print(df["title"].nunique())
print(df["timeStamp"].min())
print(df["timeStamp"].max())
Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset.
Handling Missing Data with Pandas
https://pandas.pydata.org/docs/user_guide/missing_data.html
Location and Dispersion Metrics
Location metrics:
df["*nameOfAColumn*"].mode() df["*nameOfAColumn*"].mean() df["*nameOfAColumn*"].median()
Dispersion metrics:
df["length_in_minutes"].std() df["length_in_minutes"].quantile(0.75)
We can also calculate the IQR as shown in the video:
print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))