Data Science Knowledge Base: Unterschied zwischen den Versionen
Aus MattWiki
Matt (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
Matt (Diskussion | Beiträge) Keine Bearbeitungszusammenfassung |
||
Zeile 8: | Zeile 8: | ||
** https://www.w3schools.com/python/pandas/default.asp | ** https://www.w3schools.com/python/pandas/default.asp | ||
= Exploratory Data Analysis = | == Exploratory Data Analysis == | ||
See Jupyter Notebook: '''EDA-basic-recipe.ipynb''' <syntaxhighlight lang="python3" line="1"> | See Jupyter Notebook: '''EDA-basic-recipe.ipynb''' <syntaxhighlight lang="python3" line="1"> | ||
# Load numpy and pandas libraries | # Load numpy and pandas libraries | ||
Zeile 42: | Zeile 43: | ||
Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset. | Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset. | ||
= Handling Missing Data with Pandas = | == Handling Missing Data with Pandas == | ||
https://pandas.pydata.org/docs/user_guide/missing_data.html | https://pandas.pydata.org/docs/user_guide/missing_data.html | ||
= Location and Dispersion Metrics = | == Location and Dispersion Metrics == | ||
Location metrics: | Location metrics: | ||
df["*nameOfAColumn*"].mode() | df["*nameOfAColumn*"].mode() | ||
Zeile 56: | Zeile 57: | ||
print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25)) | print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25)) | ||
= Bivariate Analysis = | == Bivariate Analysis == | ||
See Jupyter Notebook: '''Bivariate_Analsis.ipynb''' | See Jupyter Notebook: '''Bivariate_Analsis.ipynb''' | ||
== Introduction == | === Introduction === | ||
Bivariate analysis can be used for: | Bivariate analysis can be used for: | ||
Zeile 73: | Zeile 74: | ||
* No correlation = 0 | * No correlation = 0 | ||
== Correlation with Scatter Plot == | === Correlation with Scatter Plot === | ||
A scatter plot can be used to visualize the relationship between two continous variables. | A scatter plot can be used to visualize the relationship between two continous variables. | ||
== Correlation using Pandas and Seaborn == | === Correlation using Pandas and Seaborn === | ||
<syntaxhighlight lang="python3" line="1"> | <syntaxhighlight lang="python3" line="1"> | ||
import numpy as np | import numpy as np |
Version vom 14. Dezember 2023, 21:19 Uhr
This article is a knowledge base with basics for how to start a data science project.
Sources:
- openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023
- Numpy and Pandas tutorials and reference
Exploratory Data Analysis
See Jupyter Notebook: EDA-basic-recipe.ipynb
# Load numpy and pandas libraries
import numpy as np
import pandas as pd
# Read data from CSV file into a dataframe
df = pd.read_csv('911.csv')
# Show informations about columns, and number and data type of their content
print(df.info())
# Show first and last rows and columns of the dataframe
print(df)
# Show first 10 columns of dataframe
print(df.head(10))
# Describe numerical columns of dataframe by showing their min, max, count, mean and other:
print(df.describe())
# Analyze columns of interest, i.e. ZIP code, title and timeStamp:
print(df["zip"].mean())
print(df["zip"].value_counts().head(10))
print(df["zip"].value_counts().tail(10))
print(df["zip"].nunique())
print(df["title"].nunique())
print(df["timeStamp"].min())
print(df["timeStamp"].max())
Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset.
Handling Missing Data with Pandas
https://pandas.pydata.org/docs/user_guide/missing_data.html
Location and Dispersion Metrics
Location metrics:
df["*nameOfAColumn*"].mode() df["*nameOfAColumn*"].mean() df["*nameOfAColumn*"].median()
Dispersion metrics:
df["length_in_minutes"].std() df["length_in_minutes"].quantile(0.75)
We can also calculate the IQR as shown in the video:
print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))
Bivariate Analysis
See Jupyter Notebook: Bivariate_Analsis.ipynb
Introduction
Bivariate analysis can be used for:
- Exploring the relationship between two variables
- Comparing groups
- Testin hypotheses
- Predicting outcomes
Correlation between two variables is a statistical measure of the strength and direction between them
- Perfect positive correlation = -1
- Perfect negative correlation = -1
- No correlation = 0
Correlation with Scatter Plot
A scatter plot can be used to visualize the relationship between two continous variables.
Correlation using Pandas and Seaborn
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset("iris")
print(iris.sample(10))
sns.pairplot(iris)
plt.show()
sns.pairplot(iris, hue="species", diag_kind='hist')
plt.show()