Data Science Knowledge Base: Unterschied zwischen den Versionen

Version vom 19. Dezember 2023, 13:20 Uhr

This article is a knowledge base with basics for how to start a data science project.

Notable sources

openHPI Data Science Bootcamp: https://open.hpi.de/courses/datascience2023 → Go to this course for the Jupyter Notebooks mentioned below
Numpy and Pandas tutorials and reference
- https://www.w3schools.com/python/numpy/default.asp
- https://www.w3schools.com/python/pandas/default.asp
SciKit-learn: https://scikit-learn.org/stable/index.html#

Software Recommendations


Software Package	Download
Python	https://www.python.org/downloads/
pip	https://pip.pypa.io/en/stable/cli/pip_download/
Orange Data Mining	https://orangedatamining.com/download/ pip install orange

Essential libraries

pip install numpy pandas matplotlib seaborn scikit-learn

EDA / Exploratory Data Analysis

See openHPI Jupyter Notebook: EDA-basic-recipe.ipynb

# Load numpy and pandas libraries
import numpy as np
import pandas as pd

# Read data from CSV file into a dataframe
df = pd.read_csv('911.csv')                

# Show informations about columns, and number and data type of their content
print(df.info())

# Show first and last rows and columns of the dataframe
print(df)

# Show first 10 columns of dataframe
print(df.head(10))

# Describe numerical columns of dataframe by showing their min, max, count, mean and other:
print(df.describe())

# Analyze columns of interest, i.e. ZIP code, title and timeStamp:
print(df["zip"].mean())
print(df["zip"].value_counts().head(10))
print(df["zip"].value_counts().tail(10))
print(df["zip"].nunique())
print(df["title"].nunique())
print(df["timeStamp"].min())
print(df["timeStamp"].max())

Finish the exploratory data analysis by writing a management summary containing gained knowledge about the dataset.

Handling Missing Data with Pandas

https://pandas.pydata.org/docs/user_guide/missing_data.html

Location and Dispersion Metrics

Location metrics:

df["*nameOfAColumn*"].mode()
df["*nameOfAColumn*"].mean()
df["*nameOfAColumn*"].median()

Dispersion metrics:

df["length_in_minutes"].std()
df["length_in_minutes"].quantile(0.75)

We can also calculate the IQR as shown in the video:

print (df["length_in_minutes"].quantile(0.75) - df["length_in_minutes"].quantile(0.25))

Bivariate Analysis

See openHPI Jupyter Notebook: Bivariate_Analsis.ipynb

Introduction

Bivariate analysis can be used for:

Exploring the relationship between two variables
Comparing groups
Testin hypotheses
Predicting outcomes

Correlation between two variables is a statistical measure of the strength and direction between them

Perfect positive correlation = -1
Perfect negative correlation = -1
No correlation = 0

Correlation with Scatter Plot

A scatter plot can be used to visualize the relationship between two continous variables.

Correlation using Pandas and Seaborn

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

iris = sns.load_dataset("iris")

print(iris.sample(10))

sns.pairplot(iris)
plt.show()
sns.pairplot(iris, hue="species", diag_kind='hist')
plt.show()

Multivariate Analysis

See openHPI Jupyter Notebook: multivariate-analysis-video.ipynb

Enables for predicting how individual parameters influence the selected parameter, i.e.:

How much does the price of a car vary depending of
- Age
- KM
- With or without power windows

Linear Regression

Used in Machine Learning
Used to determine relationship between independent and dependent variables, which both are continious

Simply Speaking: Linear regression is basically fitting a line to a dataset using least squares method.

See openHPI Jupyter Notebook: Linear_Regression.ipynb

Documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Generate random data with positive correlation
x_val = np.random.rand(100)
y_val = x_val + np.random.random(100)*0.5

# Plot data in scatter plot
plt.scatter(x_val, y_val)

# Reshape into 2D
x = x_val.reshape(-1,1)
y = y_val.reshape(-1,1)

# Create linear regression object
model = LinearRegression()

# Fit model to the data
model.fit(x,y)

# Generate predicted values of y
y_pred = model.predict(x)

# Plot data points and regression line
plt.scatter(x_val, y_val)
plt.plot(x_val, y_pred, color='green')
plt.xlabel('X Values')
plt.ylabel('Y Values')
plt.title('Linear Regression Example')
plt.show()

Decision Trees

See openHPI Jupyter Notebook: Decision_Trees.ipynb

Types of decision trees:

Classify things into categories → Classification Tree
Predict numeric values → Regression Tree

Logistic Regression

See openHPI Jupyter Notebook: Logistic_Regression_IRIS_video.ipynb

K-Nearest Neighbor / KNN

See openHPI Jupyter Notebook: KNN_video.ipynb

Can be used for:

Classification
Regression

KNN is a supervided learning algorithm. It is called a lazy learning algorithm, which means that the algorithm does not explicitl build a model during training, it instead relies on stored instances and their associated class labels to make predictions.

@@ Zeile 23: / Zeile 23: @@
 |Orange Data Mining
 |https://orangedatamining.com/download/
+ pip install orange
 |}
 '''<u>Essential libraries</u>'''

Anonym

Suche

Data Science Knowledge Base: Unterschied zwischen den Versionen

Namensräume

Mehr

Seitenaktionen

Version vom 19. Dezember 2023, 13:20 Uhr

Inhaltsverzeichnis

Software Recommendations

EDA / Exploratory Data Analysis

Handling Missing Data with Pandas

Location and Dispersion Metrics

Bivariate Analysis

Introduction

Correlation with Scatter Plot

Correlation using Pandas and Seaborn

Multivariate Analysis

Linear Regression

Decision Trees

Logistic Regression

K-Nearest Neighbor / KNN

Navigation

Navigation

SAP

Debian GNU/Linux

Wikiwerkzeuge

Wikiwerkzeuge

Anonym

Suche

Data Science Knowledge Base: Unterschied zwischen den Versionen

Version vom 19. Dezember 2023, 13:20 Uhr

Software Recommendations

EDA / Exploratory Data Analysis

Handling Missing Data with Pandas

Location and Dispersion Metrics

Bivariate Analysis

Introduction

Correlation with Scatter Plot

Correlation using Pandas and Seaborn

Multivariate Analysis

Linear Regression

Decision Trees

Logistic Regression

K-Nearest Neighbor / KNN

Navigation

Wikiwerkzeuge

Seitenwerkzeuge

Kategorien