Precept intro

Precepts can be found here: https://eeb330.github.io/#Precepts.

We’re going to mostly be doing the precept problem sets during class, but I’ll be posting them as well. If you don’t finish during class, you can finish them on your own time and submit them to the appropriate assignment in 1 week to GitHub Classroom.

Introduction

This precept is focused on getting started programming with R and Python and utilizing git for version control – including how we will submit assigments and projects throughout the semester.

Getting started

We support two IDEs in the course: RStudio and VS Code. I’ll be using VS Code for the precepts, but feel free to use whichever you prefer. VS Code is a bit more lightweight and extensible, but RStudio has a lot of nice features specifially for R.

Before going forward, lets install the basic software we need: R and git.

I recommend Homebrew for Mac users, but you can also use the installers from the links provided.

VS Code

VS Code. Once you’ve installed and lauched VS Code, you’ll need to grab a few extensions. I’d recommend getting the R Extension and the Python Extension at minimum.

RStudio

RStudio

R and Python

Make sure that you’re able to run basic R/Python code like the following:

print("Hello world!")
## [1] "Hello world!"

You will need to create two separate files to test each language. One will be an R file, the other a Python file. Note that for this simple print statement, the syntax for R and Python is the same. As the course progresses, we will begin to learn important differences in the syntax of these two languages!

Git

For the semester we’re going to be using GitHub Classroom to manage assignments and projects. You should have received an announcement with the Classroom link. If you haven’t, please let me know.

If you have not already installed git, make sure to install it. The installation guide can be found here: https://git-scm.com/book/en/v2/Getting-Started-Installing-Git.

GitHub

First, you’ll need to create an account on GitHub. You can do that https://github.com/join.

Authentication

Install GitHub CLI – found here: https://cli.github.com/ – and authenticate with GitHub. You can do this by running the following command in your terminal:

gh auth login

GitHub Classroom

Accepting the assignment

When you accept the assignment, it will create a private repository (only visible to EEB330 staff and yourself) that you can use. Once you have that repository, you can clone it to your local machine and start working on it.

Cloning the repository

To clone the repository, you’ll need to copy the URL from the repository page. You can do this by clicking the green “Code” button and copying the URL. Then, in your terminal, you can run git clone <URL> to clone the repository to your local machine.

git clone https://github.com/EEB330/intro-to-git-{GITHUB_USERNAME}.git # Note that this can be found on the repository page under the green "Code" button after github has created the repository
cd intro-to-git-{GITHUB_USERNAME}
git checkout -b precept-1 # or what you want to call your branch

Working on the assignment in R

For this assignment, you can just create an simple example R file containing some code and text. You can then commit and push your changes to the branch you created.

Running R code: To run R code (in RStudio or VSCode), you can hit cmd+enter while selecting the line or block of code you want to run (if no line(s) are selected, then just the line that your cursor is on will run). You can also run the entire file by hitting cmd+shift+s or cmd+shift+enter. If you’re working on windows, you can use ctrl instead of cmd. You can also just click the play button in the top-right corner of the VSCode editor to run the entire file!

Our goal is to make a simple R file for exploring the iris data set.

# ----- Loading and Exploring the iris dataset -----

# Load the dataset -- note that this is a built-in dataset in R
data(iris)

# View the first few rows to understand its structure
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Check the detailed structure of the dataset for more information on its columns
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# Generate summary statistics to get a sense of the data distribution
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# ----- Data Visualization -----

# Scatter plot visualizing the relationship between Sepal measurements
plot(iris$Sepal.Length, iris$Sepal.Width, main="Sepal Length vs Sepal Width", 
     xlab="Sepal Length", ylab="Sepal Width", col=iris$Species, pch=16, cex=1.3)
legend("topright", legend=levels(iris$Species), col=1:3, pch=16)

# Scatter plot visualizing the relationship between Petal measurements
plot(iris$Petal.Length, iris$Petal.Width, main="Petal Length vs Petal Width", 
     xlab="Petal Length", ylab="Petal Width", col=iris$Species, pch=16, cex=1.3)
legend("topright", legend=levels(iris$Species), col=1:3, pch=16)

# ----- Modifying the Dataset -----

# Add a new column 'Petal.Length.Class' that classifies flowers based on petal length
iris$Petal.Length.Class <- ifelse(iris$Petal.Length < 2, "Short", 
                           ifelse(iris$Petal.Length < 5, "Medium", "Long"))

# View the initial rows of the modified dataset to see the added column
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Length.Class
## 1          5.1         3.5          1.4         0.2  setosa              Short
## 2          4.9         3.0          1.4         0.2  setosa              Short
## 3          4.7         3.2          1.3         0.2  setosa              Short
## 4          4.6         3.1          1.5         0.2  setosa              Short
## 5          5.0         3.6          1.4         0.2  setosa              Short
## 6          5.4         3.9          1.7         0.4  setosa              Short

Working on the assignment in Python

Now we will work with the same irisdataset in Python. Unlike in R, iris is not a built-in dataset that is readily available for use. Thus, you will need to download the dataset from the assignment respository and add it to your current working directory for this example.

Running Python code: To run Python code, you can hit shift+enter while selecting the line or block of code you want to run. Similarly to R, you can run an entire .py file by clicking the play button in VSCode.

# ----- Loading and Exploring the iris dataset -----

# Import the necessary libraries: pandas is used for data organization and manipulation, while matplotlib has functions useful for data visualization
import pandas as pd
import matplotlib.pyplot as plt

# Read in the data and create a DataFrame variable called "iris"
iris = pd.read_csv("iris.csv")

# View the first few rows and the structure of the dataset, note the similarities and differences with the dataset in R
print(iris.head())
##    Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
## 0   1            5.1           3.5            1.4           0.2  Iris-setosa
## 1   2            4.9           3.0            1.4           0.2  Iris-setosa
## 2   3            4.7           3.2            1.3           0.2  Iris-setosa
## 3   4            4.6           3.1            1.5           0.2  Iris-setosa
## 4   5            5.0           3.6            1.4           0.2  Iris-setosa
print(iris.info())
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 150 entries, 0 to 149
## Data columns (total 6 columns):
##  #   Column         Non-Null Count  Dtype  
## ---  ------         --------------  -----  
##  0   Id             150 non-null    int64  
##  1   SepalLengthCm  150 non-null    float64
##  2   SepalWidthCm   150 non-null    float64
##  3   PetalLengthCm  150 non-null    float64
##  4   PetalWidthCm   150 non-null    float64
##  5   Species        150 non-null    object 
## dtypes: float64(4), int64(1), object(1)
## memory usage: 7.2+ KB
## None
# We can also generate summary statistics to get a sense of the data distribution, this time usually using the describe() function
print(iris.describe())
##                Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
## count  150.000000     150.000000    150.000000     150.000000    150.000000
## mean    75.500000       5.843333      3.054000       3.758667      1.198667
## std     43.445368       0.828066      0.433594       1.764420      0.763161
## min      1.000000       4.300000      2.000000       1.000000      0.100000
## 25%     38.250000       5.100000      2.800000       1.600000      0.300000
## 50%     75.500000       5.800000      3.000000       4.350000      1.300000
## 75%    112.750000       6.400000      3.300000       5.100000      1.800000
## max    150.000000       7.900000      4.400000       6.900000      2.500000
# ----- Data Visualization -----

# Make a scatter plot of Sepal width against Sepal length
plt.title("Sepal Width vs Sepal Length")
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")

groups = iris.groupby("Species")
for name, group in groups:
    plt.scatter(group["SepalLengthCm"], group["SepalWidthCm"], label = name, marker = "o", s = 30)

plt.legend(loc = "upper right")
plt.show()

# Now try making a scatter plot of Petal width against Petal length!

# ----- Modifying the Dataset -----

# Add a new column 'PetalLengthClass' that classifies flowers based on petal length
iris["PetalLengthClass"] = pd.cut(iris["PetalLengthCm"], bins = [iris["PetalLengthCm"].min(), 2, 5, iris["PetalLengthCm"].max()], labels = ["Short", "Medium", "Long"])

# View the initial rows of the modified dataset
print(iris.head())
##    Id  SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species PetalLengthClass
## 0   1            5.1           3.5            1.4           0.2  Iris-setosa            Short
## 1   2            4.9           3.0            1.4           0.2  Iris-setosa            Short
## 2   3            4.7           3.2            1.3           0.2  Iris-setosa            Short
## 3   4            4.6           3.1            1.5           0.2  Iris-setosa            Short
## 4   5            5.0           3.6            1.4           0.2  Iris-setosa            Short

Pushing your changes

git add precept_1.Rmd # or the file you created
git commit -m "Adding precept 1" # or the message you want
git push -u origin precept-1 # or the branch you created

Note that after you set the upstream branch, you can just run git push to push your changes to the remote repository.

Submitting the assignment

Once you make any changes to the main branch, the assignment will be marked as submitted. To avoid your work being marked as submitted, make sure you use branches and pull requests.

Once you’ve made changes to your development branch, you can make a PR that details all of the changes you’ve made across multiple commits. This allows you to merge back in a single unit of work.

Creating a PR

To create a Pull Request on GitHub, you navigate to your branch (under branches on web interface), and select contribute > open pull request.

Example Pull Request

# Summary

This PR shows example formatting. Because PRs are the primary location at which code is evaluated, make sure that your PRs are clear and descriptive. PRs can include markdown so they can become relatively complex if the assignment is complicated. This should be paired with will documented code to allow others to easily follow your design and implementation.

# Design notes

N/A

# Implementation notes

N/A

WARNING

Be careful here! Please try to only merge PRs when you have completed assignments. If you merge multiple PRs for an assignment, I’ll try to loop back to them if I’ve already graded one, but if I miss them for some reason, please let me know!

Submission

For this precept, you’ll just need to make a merged PR with your changes. I’ll be checking the PRs for the assignment to make sure you wrote something in R and Python and that you’ve made a well structured PR and merged. If you have any questions, please let me know!