class: center, middle, inverse, title-slide .title[ # Python for Statistics and Data Science (part 2) ] .subtitle[ ## Models, regressions and machine learning ] .author[ ### Guillaume Falmagne ] .date[ ###
Nov. 6th, 2024 ] --- # Roles of a model A model can have two main roles: - **Understanding**: usually some physically-motivated, minimal equations that pinpoint a key phenomenon/mechanism → Generally fit the data with a model with a few parameters, where the role of each parameter is "human-readable" - **Description/prediction**: the goal here is to have a model that fits the data very well, and maybe predict future outcomes, even if it involves many parameters that are hard to interpret → This is where machine learning and dimensional reduction techniques are best --- # Simplest linear fit/regression .negspace10[ Multiple options (in `scipy` or `numpy`) exist for the simplest and most common fit/regression: a line ] .negspace20[ .pull-left[ ``` python import scipy as sc tips = sns.load_dataset("tips") x = tips['total_bill'] y = tips['tip'] # numpy polynomial fit (1 is the degree) *f = np.polyfit(x, y, 1) fit_function = np.poly1d(f) y_pred1 = fit_function(x) # OR scipy.stats slope, intercept, r_value, p_value, std_err = * sc.stats.linregress(x, y) y_pred2 = intercept + slope * x # OR scipy.optimize *param, cov = sc.optimize.curve_fit( * lambda x, a, b: a+b*x, x, y) y_pred3 = param[0] + param[1] * x plt.figure() tips.plot.scatter(x='total_bill', y='tip', c='sex', colormap='viridis') plt.plot(x, y_pred1, 'r--') plt.plot(x, y_pred2, 'b--') plt.plot(x, y_pred3, 'g--') # overlaps other lines ``` ] .pull-right[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-2-1.png" width="672" /> ] ] --- # More complex curve fitting (1) .pull-left[ ``` python netflix_df = pd.read_csv("data/netflix_titles.csv") # "https://raw.githubusercontent.com/EEB330/slides/main/16_intro_to_python_2/data/netflix_titles.csv" cut = (netflix_df['release_year'] >= 1950) & (netflix_df['release_year'] < 2022) def exponential(x, a, b): return a * np.exp(b * (x-1950)) plt.figure() t = np.linspace(1950, 2022, 72) n,_,_ = plt.hist(netflix_df['release_year'][cut], bins=t) *param, cov = sc.optimize.curve_fit( * exponential, * (t[:-1]+0.5)[:-3], n[:-3], p0=[1, 0.1]) plt.plot(t[:-3], exponential(t, *param)[:-3], 'r--') #plt.yscale('log') plt.xlabel("Year") plt.ylabel("Number of movies") ``` ] .pull-right[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-4-3.png" width="100%" /> ] --- # More complex curve fitting (1) Probably better to do a linear fit of `\(log(n)\)` rather than an exponential fit of `\(n\)`, to account for low values! .pull-left[ ``` python netflix_df = pd.read_csv("data/netflix_titles.csv") # "https://raw.githubusercontent.com/EEB330/slides/main/16_intro_to_python_2/data/netflix_titles.csv" cut = (netflix_df['release_year'] >= 1950) & (netflix_df['release_year'] < 2022) def exponential(x, a, b): return a * np.exp(b * (x-1950)) plt.figure() t = np.linspace(1950, 2022, 72) n,_,_ = plt.hist(netflix_df['release_year'][cut], bins=t) *param, cov = sc.optimize.curve_fit( * exponential, * (t[:-1]+0.5)[:-3], n[:-3], p0=[1, 0.1]) plt.plot(t[:-3], exponential(t, *param)[:-3], 'r--') plt.yscale('log') plt.xlabel("Year") plt.ylabel("Number of movies") ``` ] .pull-right[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-6-5.png" width="100%" /> ] --- # More complex curve fitting (2) More complex situations can sometimes be fitted with a physically-motivated model .pull-left[ ``` python x = np.linspace(0, 120, 120) noise = 0.4*np.random.normal(size=120) y = 0.1*x + np.sin(x / np.pi) + noise plt.figure() plt.plot(x, y) ``` ] .pull-right[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-8-7.png" width="100%" /> ] --- # More complex curve fitting (2) More complex situations can sometimes be fitted with a physically-motivated model .pull-left[ ``` python def my_func(x, a, b, c): return a*x + b*np.sin(x / (c*np.pi)) param, cov = sc.optimize.curve_fit(my_func, x, y) param # [0.10068761 0.96933067 0.9978456] plt.figure() plt.plot(x, y) plt.plot(x, my_func(x, *param)) ``` The optimisation algorithm usually uses a **gradient descent** method to find the best parameters. The quantity that is minimized is *usually* the sum of the squared differences between the data and the model (similar to `\(\chi^2\)`). It can also be **likelihood estimators** (more Bayesian approach) ] .pull-right[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-10-9.png" width="100%" /> ] --- # Solving differential equations Many models are actually not about functions, but about **relations between functions and their derivatives**. This is the realm of **differential equations**. Let's solve numerically `\(\frac{dy}{dt} = -2y\)` .pull-left[ ``` python from scipy.integrate import odeint # Define the ODE function dy/dt = -2y def dydt(y, t): return -2 * y # Initial condition and time points y0 = 1 t = np.linspace(0, 5, 100) # Solve the ODE using odeint y_numerical = odeint(dydt, y0, t) # Analytical solution for comparison y_analytical = y0 * np.exp(-2 * t) plt.plot(t, y_numerical, label='Numerical Solution (odeint)', color='blue') plt.plot(t, y_analytical, 'r--', label='Analytical Solution') plt.xlabel('t') ``` ] .pull-right[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-12-11.png" width="100%" /> ] --- # Machine Learning Approaches Overview Machine learning can be used to get an extensive description of highly dimensional data - **Multilinear Regression** (not strictly ML): can model confounding variables to discriminate causality and causation - **Dimensionality Reduction**: - **PCA** (Principal Component Analysis): Reduces dataset dimensions while retaining most variance. - **t-SNE** (t-Distributed Stochastic Neighbor Embedding): Projects high-dimensional data to 2D/3D. - **Classification** and Clustering: - **K-Nearest Neighbors** (KNN): Classifies based on nearest labeled examples. Also see **K-Means Clustering** - **Decision Trees** & Random Forests: Tree-based methods for classification and regression. - Ensemble Methods (e.g., Boosting, Bagging): Combines multiple models (e.g. trees). - **Neural Networks** and Deep Learning - **Convolutional** Neural Networks (CNNs): Specialized for image processing and spatial data. - **Recurrent** Neural Networks (RNNs): Useful for sequential data, like time series. - Transformers & **Large Language Models** (LLMs): Powerful architectures for NLP tasks, e.g., ChatGPT, BERT. --- # Multilinear Regression example .left-column60[ ``` python from sklearn.datasets import fetch_california_housing data = fetch_california_housing() X = data.data # Features (9 variables) y = data.target # Target variable (housing prices) # Add an intercept column to X in the linear model X_with_intercept = np.column_stack((np.ones(X.shape[0]), X)) # linear regression using scipy's least-squares solver # beta contains the 9 coefficients of the linear model beta, _, _, _ = np.linalg.lstsq(X_with_intercept, y, rcond=None) # Predictions y_pred = X_with_intercept @ beta # Matrix multiplication plt.scatter(y, y_pred, alpha=0.5, color="blue") plt.xlabel("Actual Prices") plt.ylabel("Predicted Prices") plt.title("Multiple Linear Regression: Predicted vs. Actual Prices") plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--') # Line for perfect prediction plt.show() ``` ] .right-column60[ <img src="16_python_stats_2_files/figure-html/unnamed-chunk-14-13.png" width="100%" /> ] --- # A more complex model: gradient-boosted decision trees .negspace60[ ``` python import numpy as np from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from xgboost import XGBClassifier from sklearn.metrics import accuracy_score # Load the Iris dataset data = load_iris() X = data.data # Features y = data.target # Target variable (species) # Split the data into training and testing sets (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize the XGBoost classifier model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss') # Train the model on the training data model.fit(X_train, y_train) # Predict on the test set y_pred = model.predict(X_test) # Calculate and print the accuracy accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy:.2f}") ``` ] Then the problem of interpretability... Use SHAPley additive values!