← Zum vorherigen Versionsunterschied Zum nächsten Versionsunterschied →

Version vom 14. Dezember 2023, 18:43 Uhr

This article is a knowledge base with basics for how to start a data science project.

Sources:

openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023
Numpy and Pandas tutorials and reference
- https://www.w3schools.com/python/numpy/default.asp
- https://www.w3schools.com/python/pandas/default.asp

Exploratory Data Analysis

See Jupyter Notebook: EDA-basic-recipe.ipynb

# Load numpy and pandas libraries
import numpy as np
import pandas as pd

# Read data from CSV file into a dataframe
df = pd.read_csv('911.csv')                

# Show informations about columns, and number and data type of their content
print(df.info())

# Show first and last rows and columns of the dataframe
print(df)

# Show first 10 columns of dataframe
print(df.head(10))

# Describe numerical columns of dataframe by showing their min, max, count, mean and other:
print(df.describe())

# Analyze columns of interest, i.e. ZIP code, title and timeStamp:
print(df["zip"].mean())
print(df["zip"].value_counts().head(10))
print(df["zip"].value_counts().tail(10))
print(df["zip"].nunique())
print(df["title"].nunique())
print(df["timeStamp"].min())
print(df["timeStamp"].max())

Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset.

Handling Missing Data with Pandas

https://pandas.pydata.org/docs/user_guide/missing_data.html

Location and Dispersion Metrics

Location metrics:

df["*nameOfAColumn*"].mode()
df["*nameOfAColumn*"].mean()
df["*nameOfAColumn*"].median()

Dispersion metrics:

df["length_in_minutes"].std()
df["length_in_minutes"].quantile(0.75)

We can also calculate the IQR as shown in the video:

print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))

Bivariate Analysis

See Jupyter Notebook: Bivariate_Analsis.ipynb

Introduction

Bivariate analysis can be used for:

Exploring the relationship between two variables
Comparing groups
Testin hypotheses
Predicting outcomes

Correlation between two variables is a statistical measure of the strength and direction between them

Perfect positive correlation = -1
Perfect negative correlation = -1
No correlation = 0

Correlation with Scatter Plot

A scatter plot can be used to visualize the relationship between two continous variables.

Correlation using Pandas and Seaborn

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

iris = sns.load_dataset("iris")

print(iris.sample(10))

sns.pairplot(iris)
plt.show()
sns.pairplot(iris, hue="species", diag_kind='hist')
plt.show()

@@ Zeile 9: / Zeile 9: @@
 = Exploratory Data Analysis =
-<syntaxhighlight lang="python3" line="1">
+See Jupyter Notebook: '''EDA-basic-recipe.ipynb''' <syntaxhighlight lang="python3" line="1">
 # Load numpy and pandas libraries
 import numpy as np
@@ Zeile 55: / Zeile 55: @@
 We can also calculate the IQR as shown in the video:
   print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))
+== Bivariate Analysis ==
+See Jupyter Notebook: '''Bivariate_Analsis.ipynb'''
+== Introduction ==
+Bivariate analysis can be used for:
+* Exploring the relationship between two variables
+* Comparing groups
+* Testin hypotheses
+* Predicting outcomes
+Correlation between two variables is a statistical measure of the strength and direction between them
+* Perfect positive correlation = -1
+* Perfect negative correlation = -1
+* No correlation = 0
+== Correlation with Scatter Plot ==
+A scatter plot can be used to visualize the relationship between two continous variables.
+== Correlation using Pandas and Seaborn ==
+<syntaxhighlight lang="python3" line="1">
+import numpy as np
+import pandas as pd
+import matplotlib as mpl
+import matplotlib.pyplot as plt
+import seaborn as sns
+iris = sns.load_dataset("iris")
+print(iris.sample(10))
+sns.pairplot(iris)
+plt.show()
+sns.pairplot(iris, hue="species", diag_kind='hist')
+plt.show()
+</syntaxhighlight>
 [[Kategorie:Python]]
 [[Kategorie:Data Science]]

Anonym

Suche

Data Science Knowledge Base: Unterschied zwischen den Versionen

Namensräume

Mehr

Seitenaktionen

Version vom 14. Dezember 2023, 18:43 Uhr

Inhaltsverzeichnis

Exploratory Data Analysis

Handling Missing Data with Pandas

Location and Dispersion Metrics

Bivariate Analysis

Introduction

Correlation with Scatter Plot

Correlation using Pandas and Seaborn

Navigation

Navigation

SAP Development

Debian & Fedora Linux

Wikiwerkzeuge

Wikiwerkzeuge

Anonym

Suche

Data Science Knowledge Base: Unterschied zwischen den Versionen

Version vom 14. Dezember 2023, 18:43 Uhr

Exploratory Data Analysis

Handling Missing Data with Pandas

Location and Dispersion Metrics

Bivariate Analysis

Introduction

Correlation with Scatter Plot

Correlation using Pandas and Seaborn

Navigation

Wikiwerkzeuge

Seitenwerkzeuge

Kategorien