Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File paths, organization and notebooks #35

Merged
merged 9 commits into from
Dec 20, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
181 changes: 172 additions & 9 deletions 05-setting-up.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,35 +9,169 @@ ottrpal::set_knitr_image_path()
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g21a84b32106_0_13")
```

## Understand why project organization is key to reproducible analyses

Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files and terms you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped.

## Understand why project organization is key to reproducible analyses
@Tayo2019 discusses four particular reasons why it is important to organize your project:

> 1. Organization **increases productivity**. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on.
> 2. A well-organized project helps you to keep and **maintain a record** of your ongoing and completed data science projects.
> 3. Completed data science projects could be used for **building future models**. If you have to solve a similar problem in the future, you can use the same code with slight modifications.
> 4. A well-organized project can **easily be understood** by other data science professionals when shared on platforms such as Github.

Organization is yet another aspect of reproducibility that saves you and your colleagues time!

```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1121")
```

## Understand general principles of project organization
## General principles of project organization

Project organization should work for you and not the other way around. The goal should be organization that is maintainable long term. As you might imagine, the optimal organizational scheme might differ from one individual to another or even one project to another.

```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_421")
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_426")
```

There's a lot of ways to keep your files organized, and there's not a "one size fits all" organizational solution [@Shapiro2021]. In this chapter, we will discuss some generalities; but for specifics, we will point you to others who have written about what works for them. We suggest that you use them as inspiration to figure out a strategy that works for you and your team.

The most important aspects of your project organization scheme is that it:

- Is [project-oriented](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) [@Bryan2017].
- Follows consistent patterns [@Shapiro2021].
- Is easy for you and others to find the files you need quickly [@Shapiro2021].
- Minimizes the likelihood for errors (like writing over files accidentally) [@Shapiro2021].
- Is something maintainable [@Shapiro2021]!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk why but these resources all have question marks in the render preview

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH shoot I must be missing the references in the bib file. Will look into this.

```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_426")
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_421")
```

### READMEs!

READMEs are also a great way to help your collaborators get quickly acquainted with the project.

```{r, fig.align='center', echo = FALSE, fig.alt= "Avi is looking at a set of project files that include a file called a ‘README.md’. Avi says 'I had no idea where to start with this analysis that Ruby sent me to review, but then I saw she included a README and that saved me so much time and effort in getting started!'"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf8379bb805_0_11")
```

READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called "README.md" when someone comes to the main page of your repository. This further encourages people looking at your project to read the information in your README.

**Information that should be included in a README:**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include who is involved in the project or who to contact? (I imagine some people will not be on GitHub)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should talk about licenses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do it briefly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added something below!


1) General purpose of the project
2) Instructions on how to re-run the project
3) Lists of any software required by the project
4) Input and output file descriptions
5) Descriptions of any additional tools included in the project

cansavvy marked this conversation as resolved.
Show resolved Hide resolved
You can take a look at this [template README](https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/resources/README-template.md) to get your started.

#### More about writing READMEs:
carriewright11 marked this conversation as resolved.
Show resolved Hide resolved

- [How to write a good README file by Hillary Nyakundi](https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/)
- [How to write an awesome README by Navendu Pottekkat](https://towardsdatascience.com/how-to-write-an-awesome-readme-68bf4be91f8b)

#### Examples of good READMEs:

- https://github.com/stephaniehicks/qsmooth
- https://github.com/lcolladotor/derfinder
- https://github.com/tidyverse/dplyr

cansavvy marked this conversation as resolved.
Show resolved Hide resolved
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
### Example organization scheme

Getting more specific, here's some ideas of how to organize your project:

- **Make file names informative** to those who don't have knowledge of the project -- but avoid using spaces, quotes, or unusual characters in your filenames and folders, as these can make reading in files a nightmare with some programs.
- **Number scripts** in the order that they are run.
- **Keep like-files together** in their own directory: results tables with other results tables, etc. _Including most importantly keeping raw data separate from processed data or other results!_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe could add sub listing for raw data separate from processed data so we can say a bit more about why? (that it helps us keep from overwriting our raw data)

- **Put source scripts and functions in their own directory**. Things that should never need to be called directly by yourself or anyone else.
- **Put output in its own directories** like `results` and `plots`.
- **Have a central document (like a README)** that describes the basic information about the analysis and how to re-run it.
- Make it easy on yourself, **dates aren't necessary**. The computer keeps track of those.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe need a bit more info about this... that dates of file updates aren't necessary? if a survey is from 2020 that might be necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see below

cansavvy marked this conversation as resolved.
Show resolved Hide resolved
- **Make a central script that re-runs everything** -- including the creation of the folders! (more on this in a later chapter)

Let's see what these principles might look in practice.

```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_442")
```

https://github.com/jhudsl/reproducible-r-example
Here's an example of what this might look like:
```
project-name/
├── run_analysis.sh
├── 00-download-data.sh
├── 01-make-heatmap.Rmd
├── README.md
├── plots/
│ └── project-name-heatmap.png
├── results/
│ └── top_gene_results.tsv
├── raw-data/
│ ├── project-name-raw.tsv
│ └── project-name-metadata.tsv
├── processed-data/
│ ├── project-name-quantile-normalized.tsv
└── util/
├── plotting-functions.R
└── data-wrangling-functions.R
```

**What these hypothetical files and folders contain:**

- `run_analysis.sh` - A central script that runs everything
- `00-download-data.sh` - The script that needs to be run first and is called by run_analysis.sh
- `01-make-heatmap.Rmd` - The script that needs to be run second and is also called by run_analysis.sh
- `README.md` - The document that has the information that will orient someone to this project
- `plots` - A folder of plots and resulting images
- `results` - A folder of results
- `raw-data` - Data files as they first arrive and **nothing** has been done to them yet
- `processed-data` - Data that has been modified from the raw in some way
- `util` - A folder of utilities that never needs to be called or touched directly unless troubleshooting something

There are lots of ideas out there for organizational strategies. The key is finding one that fits your team and your project. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team:

- [Reproducible R example](https://github.com/jhudsl/reproducible-r-example)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could site yourself?

- [Jenny Bryan's organizational strategies](https://www.stat.ubc.ca/~jenny/STAT545A/block19_codeFormattingOrganization.html) [@Bryan2021].
- [Danielle Navarro's organizational strategies](https://www.youtube.com/playlist?list=PLRPB0ZzEYegPiBteC2dRn95TX9YefYFyy) @Navarro2021
- [Jenny Bryan on Project-oriented workflows](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/)[@Bryan2017].
- [Data Carpentry mini-course about organizing projects](https://datacarpentry.org/organization-genomics/) [@DataCarpentry2021].
- [Andrew Severin's strategy for organization](https://bioinformaticsworkbook.org/projectManagement/Intro_projectManagement.html#gsc.tab=0) [@Severin2021].
- [A BioStars thread where many individuals share their own organizational strategies](https://www.biostars.org/p/821/) [@Biostars2021].
- [Data Carpentry course chapter about getting organized](https://bioinformatics-core-shared-training.github.io/shell-genomics/07-organization/index.html) [@DataCarpentry2019].

## Navigate file paths

In point and click apps (called [Graphical User Interfaces (or GUI pronounced like the word gooey)](https://en.wikipedia.org/wiki/Graphical_user_interface) you navigate to files by clicking on folders. But for R programming and other command line interfaces, we navigate and use files by using `file paths`. `File paths` are the series of folders that it takes to get to a file, not unlike a street address.

To make an analogy, if someone asked you directions to a particular building, the directions you would give would be tailored based on where the person asking is located. In other words your directions would be relative to their location.

But file paths can be *relative* or *absolute*.

Your computer can be given directions relative to where you are calling the command in the computer or they can be absolute directions to a file - basically the full directions to that file, regardless of where you might be already on your computer.

```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"}
cansavvy marked this conversation as resolved.
Show resolved Hide resolved
ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1337")
```
So in our above analogy, if you are trying to direct someone to somewhere on Johns Hopkins’ campus with a file path:

An absolute file path would be:
`/Earth/North America/United States/Maryland/Baltimore/Johns Hopkins University/Street Name/Building number`

Whereas if the person was already in Baltimore, a relative file path would be:
`Johns Hopkins University/Street Name/Building number`

The end of a path string may be a file name if you are creating a path to a file. If you are creating a path to a folder, the path string will end with the destination folder.

To know your location within a file system is to know exactly what folder you are in right now. The folder that you are in right now is called the `working directory` aka your "Current Location". In the above analogy a person being located in Baltimore would be their working directory. In a path, folder names are separated by forward slashes `/`

Note that a relative directory may be different between different apps: RStudio versus Terminal versus something else. So you if you switch between the `Console` and `Terminal` tabs, you will have to pay attention to what your `working directory` is. This is also different from the `Files` pane which has no bearing on your working directory either. The terminal tab is located in the Console pane in RStudio, which is usually the lower left pane (with default settings). You can use the terminal to work with files using the command line.

Returning to computer files. In your Terminal you can see your working directory at the top of the Terminal window or at the beginning of the terminal prompt. Knowing this, this can tell you how you need to change the command you are entering. Let’s say you want to list, using the `ls` command, a file called `file.txt`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe need an image of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@carriewright11 carriewright11 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok Im adding one... but I feel like we need a tad more walk though of the command line stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be its own chapter before we get here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and file paths are so hard for people to understand too - maybe better on it's own as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

want to merge this and then decide if we want to expand later?

cansavvy marked this conversation as resolved.
Show resolved Hide resolved

Alternatively, you can specify an absolute path. An absolute path starts at the root directory of a file system. The root directory does not have a name like other folders do. It is specified with a single forward slash `/` and is special in that it cannot be contained within other folders. In the current file system, the root directory contains a folder called cloud, which contains a folder called project which contains the folder `data_analysis_project`, which has the file.txt file we are looking for.
cansavvy marked this conversation as resolved.
Show resolved Hide resolved

## Handy R Tools

Expand All @@ -57,10 +191,39 @@ https://bookdown.org/ndphillips/YaRrr/projects-in-rstudio.html

http://projecttemplate.net/

### R Markdown files
### Scientific notebooks (Rmd or qmd)

Using notebooks can be a very helpful tool for documenting the development of an analysis.

Data analyses can lead one on a winding trail of decisions and side investigations, but notebooks allow you to narrate your thought process as you travel along these analyses explorations!

```{r, fig.align='center', echo = FALSE, fig.alt= "Ruby is looking at her computer that has a lovely notebook with a heatmap! Ruby says ‘Working from this notebook allows me to interactively develop on my data analysis and write down my thoughts about the process all in one place!’"}
ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf8f405fdab_0_186")
```

**Your scientific notebook should include descriptions that describe:**

#### The purposes of the notebook

It can be helpful to others and your future self to describe:

- The scientific question are you trying to answer
- The dataset you are using to try to answer this question
- An explanation for the choice of the dataset to help answer this question

#### The rationales behind your decisions

Describe major code decisions. For example, why you chose to use specific packages or why you took certain steps in that specific order. This can be very general to very specific, such as why a particular code chunk is doing a particular thing. The more possible options there were for choices or the more unusual a process that you might have taken, the greater the need to describe why you made certain decisions.

Describe any particular filters or cutoffs you are using and how did you decided on those.

For data wrangling steps, describe why you are wrangling the data in such a way. Is this because a certain package you are using requires it?

#### Your observations of the results

https://rmarkdown.rstudio.com/articles_intro.html
In this section it is helpful to include:

### Quarto files
- What do you currently think about the results?
- What do you think about the plots and tables you show in the notebook -- how do they inform your original questions?

https://quarto.org/docs/get-started/hello/rstudio.html
There are two major types of notebooks folks use in the R programming language: R Markdown files and Quarto files. In the next section we will discuss these notebooks, the similarities and differences between these two options, and how to use them.
Loading
Loading