← Zum vorherigen Versionsunterschied Zum nächsten Versionsunterschied →

Version vom 14. Dezember 2023, 16:39 Uhr

This article is a knowledge base with basics for how to start a data science project.

Sources:

openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023
Numpy and Pandas tutorials and reference
- https://www.w3schools.com/python/numpy/default.asp
- https://www.w3schools.com/python/pandas/default.asp

Exploratory Data Analysis

# Load numpy and pandas libraries
import numpy as np
import pandas as pd

# Read data from CSV file into a dataframe
df = pd.read_csv('911.csv')                

# Show informations about columns, and number and data type of their content
print(df.info())

# Show first and last rows and columns of the dataframe
print(df)

# Show first 10 columns of dataframe
print(df.head(10))

# Describe numerical columns of dataframe by showing their min, max, count, mean and other:
print(df.describe())

# Analyze columns of interest, i.e. ZIP code, title and timeStamp:
print(df["zip"].mean())
print(df["zip"].value_counts().head(10))
print(df["zip"].value_counts().tail(10))
print(df["zip"].nunique())
print(df["title"].nunique())
print(df["timeStamp"].min())
print(df["timeStamp"].max())

Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset.

Handling Missing Data with Pandas

https://pandas.pydata.org/docs/user_guide/missing_data.html

Location and Dispersion Metrics

Location metrics:

df["*nameOfAColumn*"].mode()
df["*nameOfAColumn*"].mean()
df["*nameOfAColumn*"].median()

Dispersion metrics:

df["length_in_minutes"].std()
df["length_in_minutes"].quantile(0.75)

We can also calculate the IQR as shown in the video:

print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))

@@ Zeile 55: / Zeile 55: @@
 We can also calculate the IQR as shown in the video:
   print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))
+[[Kategorie:Python]]
+[[Kategorie:Data Science]]

Anonym

Suche

Data Science Knowledge Base: Unterschied zwischen den Versionen

Namensräume

Mehr

Seitenaktionen

Version vom 14. Dezember 2023, 16:39 Uhr

Exploratory Data Analysis

Handling Missing Data with Pandas

Location and Dispersion Metrics

Navigation

Navigation

SAP Development

Debian & Fedora Linux

Wikiwerkzeuge

Wikiwerkzeuge

Anonym

Suche

Data Science Knowledge Base: Unterschied zwischen den Versionen

Version vom 14. Dezember 2023, 16:39 Uhr

Exploratory Data Analysis

Handling Missing Data with Pandas

Location and Dispersion Metrics

Navigation

Wikiwerkzeuge

Seitenwerkzeuge

Kategorien