Data Science Knowledge Base: Unterschied zwischen den Versionen
Aus MattWiki
Matt (Diskussion | Beiträge) (Die Seite wurde neu angelegt: „This article is a knowledge base with basics for how to start a data science project. Sources: * openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023 * Numpy and Pandas tutorials and reference ** https://www.w3schools.com/python/numpy/default.asp ** https://www.w3schools.com/python/pandas/default.asp = Exploratory Data Analysis = <syntaxhighlight lang="python3" line="1"> # Load numpy and pandas libraries import numpy as np import…“) |
Matt (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
||
Zeile 40: | Zeile 40: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset. | |||
= Handling Missing Data with Pandas = | |||
https://pandas.pydata.org/docs/user_guide/missing_data.html | |||
= Location and Dispersion Metrics = | |||
Location metrics: | |||
df["*nameOfAColumn*"].mode() | |||
df["*nameOfAColumn*"].mean() | |||
df["*nameOfAColumn*"].median() | |||
Dispersion metrics: | |||
df["length_in_minutes"].std() | |||
df["length_in_minutes"].quantile(0.75) | |||
We can also calculate the IQR as shown in the video: | |||
print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25)) |
Version vom 14. Dezember 2023, 15:39 Uhr
This article is a knowledge base with basics for how to start a data science project.
Sources:
- openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023
- Numpy and Pandas tutorials and reference
Exploratory Data Analysis
# Load numpy and pandas libraries
import numpy as np
import pandas as pd
# Read data from CSV file into a dataframe
df = pd.read_csv('911.csv')
# Show informations about columns, and number and data type of their content
print(df.info())
# Show first and last rows and columns of the dataframe
print(df)
# Show first 10 columns of dataframe
print(df.head(10))
# Describe numerical columns of dataframe by showing their min, max, count, mean and other:
print(df.describe())
# Analyze columns of interest, i.e. ZIP code, title and timeStamp:
print(df["zip"].mean())
print(df["zip"].value_counts().head(10))
print(df["zip"].value_counts().tail(10))
print(df["zip"].nunique())
print(df["title"].nunique())
print(df["timeStamp"].min())
print(df["timeStamp"].max())
Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset.
Handling Missing Data with Pandas
https://pandas.pydata.org/docs/user_guide/missing_data.html
Location and Dispersion Metrics
Location metrics:
df["*nameOfAColumn*"].mode() df["*nameOfAColumn*"].mean() df["*nameOfAColumn*"].median()
Dispersion metrics:
df["length_in_minutes"].std() df["length_in_minutes"].quantile(0.75)
We can also calculate the IQR as shown in the video:
print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))