Python for Statistics and Data Science (part 2)

class: center, middle, inverse, title-slide

.title[
# Python for Statistics and Data Science (part 2)
]
.subtitle[
## Models, regressions and machine learning
]
.author[
### Guillaume Falmagne
]
.date[
### <br>Nov. 6th, 2024
]

---

# Roles of a model

A model can have two main roles:

- **Understanding**:  
usually some physically-motivated, minimal equations that pinpoint a key phenomenon/mechanism   
&rarr; Generally fit the data with a model with a few parameters, where the role of each parameter is "human-readable"

- **Description/prediction**:  
the goal here is to have a model that fits the data very well,  
and maybe predict future outcomes, even if it involves many parameters that are hard to interpret  
&rarr; This is where machine learning and dimensional reduction techniques are best

---

# Simplest linear fit/regression

.negspace10[
Multiple options (in `scipy` or `numpy`) exist for the simplest and most common fit/regression: a line
]

.negspace20[
.pull-left[

``` python
import scipy as sc
tips = sns.load_dataset("tips")
x = tips['total_bill']
y = tips['tip']
# numpy polynomial fit (1 is the degree)
*f = np.polyfit(x, y, 1)
fit_function = np.poly1d(f)
y_pred1 = fit_function(x)
# OR scipy.stats
slope, intercept, r_value, p_value, std_err = 
*   sc.stats.linregress(x, y)
y_pred2 = intercept + slope * x
# OR scipy.optimize
*param, cov = sc.optimize.curve_fit(
*            lambda x, a, b: a+b*x,   x, y)
y_pred3 = param[0] + param[1] * x

plt.figure()
tips.plot.scatter(x='total_bill', y='tip', c='sex', colormap='viridis')
plt.plot(x, y_pred1, 'r--')
plt.plot(x, y_pred2, 'b--')
plt.plot(x, y_pred3, 'g--') # overlaps other lines
```
]
.pull-right[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-2-1.png" width="672" />
]
]

---

# More complex curve fitting (1)

.pull-left[

``` python
netflix_df = pd.read_csv("data/netflix_titles.csv") # "https://raw.githubusercontent.com/EEB330/slides/main/16_intro_to_python_2/data/netflix_titles.csv"
cut = (netflix_df['release_year'] >= 1950) & (netflix_df['release_year'] < 2022)

def exponential(x, a, b):
    return a * np.exp(b * (x-1950))

plt.figure()
t = np.linspace(1950, 2022, 72)
n,_,_ = plt.hist(netflix_df['release_year'][cut], bins=t)

*param, cov = sc.optimize.curve_fit(
*   exponential,
*   (t[:-1]+0.5)[:-3], n[:-3],
    p0=[1, 0.1])
plt.plot(t[:-3], exponential(t, *param)[:-3], 'r--')

#plt.yscale('log')
plt.xlabel("Year")
plt.ylabel("Number of movies")
```
]
.pull-right[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-4-3.png" width="100%" />
]

---

# More complex curve fitting (1)

Probably better to do a linear fit of `\(log(n)\)` rather than an exponential fit of `\(n\)`, to account for low values!

.pull-left[

def exponential(x, a, b):
    return a * np.exp(b * (x-1950))

plt.figure()
t = np.linspace(1950, 2022, 72)
n,_,_ = plt.hist(netflix_df['release_year'][cut], bins=t)

*param, cov = sc.optimize.curve_fit(
*   exponential,
*   (t[:-1]+0.5)[:-3], n[:-3],
    p0=[1, 0.1])
plt.plot(t[:-3], exponential(t, *param)[:-3], 'r--')

plt.yscale('log')
plt.xlabel("Year")
plt.ylabel("Number of movies")
```
]
.pull-right[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-6-5.png" width="100%" />
]

---

# More complex curve fitting (2)

More complex situations can sometimes be fitted with a physically-motivated model

.pull-left[

``` python
x = np.linspace(0, 120, 120)
noise = 0.4*np.random.normal(size=120)
y = 0.1*x + np.sin(x / np.pi) + noise

plt.figure()
plt.plot(x, y)
```
]
.pull-right[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-8-7.png" width="100%" />
]

---
# More complex curve fitting (2)

More complex situations can sometimes be fitted with a physically-motivated model

.pull-left[

``` python
def my_func(x, a, b, c):
    return a*x + b*np.sin(x / (c*np.pi))

param, cov = sc.optimize.curve_fit(my_func, x, y)
param # [0.10068761 0.96933067 0.9978456]

plt.figure()
plt.plot(x, y)
plt.plot(x, my_func(x, *param))
```

The optimisation algorithm usually uses a **gradient descent** method to find the best parameters.  
The quantity that is minimized is *usually* the sum of the squared differences between the data and the model (similar to `\(\chi^2\)`).  
It can also be **likelihood estimators** (more Bayesian approach)
]
.pull-right[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-10-9.png" width="100%" />
]

---

# Solving differential equations

Many models are actually not about functions, but about **relations between functions and their derivatives**.  
This is the realm of **differential equations**. Let's solve numerically `\(\frac{dy}{dt} = -2y\)`

.pull-left[

``` python
from scipy.integrate import odeint

# Define the ODE function dy/dt = -2y
def dydt(y, t):
    return -2 * y

# Initial condition and time points
y0 = 1  
t = np.linspace(0, 5, 100)

# Solve the ODE using odeint
y_numerical = odeint(dydt, y0, t)
# Analytical solution for comparison
y_analytical = y0 * np.exp(-2 * t)

plt.plot(t, y_numerical, label='Numerical Solution (odeint)', color='blue')
plt.plot(t, y_analytical, 'r--', label='Analytical Solution')
plt.xlabel('t')
```
]
.pull-right[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-12-11.png" width="100%" />
]

---

# Machine Learning Approaches Overview

Machine learning can be used to get an extensive description of highly dimensional data

- **Multilinear Regression** (not strictly ML): can model confounding variables to discriminate causality and causation

- **Dimensionality Reduction**:
    - **PCA** (Principal Component Analysis): Reduces dataset dimensions while retaining most variance.
    - **t-SNE** (t-Distributed Stochastic Neighbor Embedding): Projects high-dimensional data to 2D/3D.

- **Classification** and Clustering:
    - **K-Nearest Neighbors** (KNN): Classifies based on nearest labeled examples. Also see **K-Means Clustering**
    - **Decision Trees** & Random Forests: Tree-based methods for classification and regression.
    - Ensemble Methods (e.g., Boosting, Bagging): Combines multiple models (e.g. trees).

- **Neural Networks** and Deep Learning
    - **Convolutional** Neural Networks (CNNs): Specialized for image processing and spatial data.
    - **Recurrent** Neural Networks (RNNs): Useful for sequential data, like time series.
    - Transformers & **Large Language Models** (LLMs): Powerful architectures for NLP tasks, e.g., ChatGPT, BERT.

---

# Multilinear Regression example

.left-column60[

``` python
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

X = data.data  # Features (9 variables)
y = data.target  # Target variable (housing prices)

# Add an intercept column to X in the linear model
X_with_intercept = np.column_stack((np.ones(X.shape[0]), X))

# linear regression using scipy's least-squares solver
# beta contains the 9 coefficients of the linear model
beta, _, _, _ = np.linalg.lstsq(X_with_intercept, y, 
                                rcond=None) 
# Predictions
y_pred = X_with_intercept @ beta # Matrix multiplication

plt.scatter(y, y_pred, alpha=0.5, color="blue")
plt.xlabel("Actual Prices")
plt.ylabel("Predicted Prices")
plt.title("Multiple Linear Regression: Predicted vs. Actual Prices")
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')  # Line for perfect prediction
plt.show()
```
]
.right-column60[
<img src="16_python_stats_2_files/figure-html/unnamed-chunk-14-13.png" width="100%" />
]

---

# A more complex model: gradient-boosted decision trees

.negspace60[

``` python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
data = load_iris()
X = data.data  # Features
y = data.target  # Target variable (species)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the XGBoost classifier
model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss')

# Train the model on the training data
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Calculate and print the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
```
]

Then the problem of interpretability... Use SHAPley additive values!