From 8812e7eeebeb9483b6e70ec8052c235d9f1fd335 Mon Sep 17 00:00:00 2001 From: cansavvy Date: Mon, 18 Nov 2024 10:48:36 -0500 Subject: [PATCH 1/8] file-organization --- 05-setting-up.Rmd | 111 +++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 106 insertions(+), 5 deletions(-) diff --git a/05-setting-up.Rmd b/05-setting-up.Rmd index e754452..1c32389 100644 --- a/05-setting-up.Rmd +++ b/05-setting-up.Rmd @@ -9,35 +9,136 @@ ottrpal::set_knitr_image_path() ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g21a84b32106_0_13") ``` +## Understand why project organization is key to reproducible analyses +Keeping your files organized is a skill that has a high long-term payoff. As you are in the thick of an analysis, you may underestimate how many files and terms you have floating around. But a short time later, you may return to your files and realize your organization was not as clear as you hoped. -## Understand why project organization is key to reproducible analyses +@Tayo2019 discusses four particular reasons why it is important to organize your project: + +> 1. Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. +> 2. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects. +> 3. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications. +> 4. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github. + +Organization is yet another aspect of reproducibility that saves you and your colleagues time! ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1121") ``` -## Understand general principles of project organization +## General principles of project organization + +Project organization should work for you and not the other way around. Maintainably organized should be the goal. Something that is effective but can be maintained long term. As you can imagine, organizational schemes are not one size fits all. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} -ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_421") +ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_426") ``` +There's a lot of ways to keep your files organized, and there's not a "one size fits all" organizational solution [@Shapiro2021]. In this chapter, we will discuss some generalities but as far as specifics we will point you to others who have written about works for them and advise that you use them as inspiration to figure out a strategy that works for you and your team. + +The most important aspects of your project organization scheme is that it: + +- Is [project-oriented](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) [@Bryan2017]. +- Follows consistent patterns [@Shapiro2021]. +- Is easy for you and others to find the files you need quickly [@Shapiro2021]. +- Minimizes the likelihood for errors (like writing over files accidentally) [@Shapiro2021]. +- Is something maintainable [@Shapiro2021]! + ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} -ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_426") +ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_421") ``` +Getting more specific, here's some ideas of how to organize your project: + +- **Make file names informative** to those who don't have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders -- these only serve to make reading in files a nightmare in some programs. +- **Number scripts** in the order that they are run. +- **Keep like-files together** in their own directory: results tables with other results tables, etc. _Including most importantly keeping raw data separate from processed data or other results!_ +- **Put source scripts and functions in their own directory**. Things that should never need to be called directly by yourself or anyone else. +- **Put output in its own directories** like `results` and `plots`. +- **Have a central document (like a README)** that describes the basic information about the analysis and how to re-run it. +- Make it easy on yourself, **dates aren't necessary**. The computer keeps track of those. +- **Make a central script that re-runs everything** -- including the creation of the folders! (more on this in a later chapter) + +Let's see what these principles might look like put into practice. + ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_442") ``` -https://github.com/jhudsl/reproducible-r-example +Here's an example of what this might look like: +``` +project-name/ +├── run_analysis.sh +├── 00-download-data.sh +├── 01-make-heatmap.Rmd +├── README.md +├── plots/ +│ └── project-name-heatmap.png +├── results/ +│ └── top_gene_results.tsv +├── raw-data/ +│ ├── project-name-raw.tsv +│ └── project-name-metadata.tsv +├── processed-data/ +│ ├── project-name-quantile-normalized.tsv +└── util/ + ├── plotting-functions.R + └── data-wrangling-functions.R +``` + +**What these hypothetical files and folders contain:** + +- `run_analysis.sh` - A central script that runs everything again +- `00-download-data.sh` - The script that needs to be run first and is called by run_analysis.sh +- `01-make-heatmap.Rmd` - The script that needs to be run second and is also called by run_analysis.sh +- `README.md` - The document that has the information that will orient someone to this project, we'll discuss more about how to create a helpful README in [an upcoming chapter](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/documenting-analyses.html#readmes). +- `plots` - A folder of plots and resulting images +- `results` - A folder results +- `raw-data` - Data files as they first arrive and **nothing** has been done to them yet. +- `processed-data` - Data that has been modified from the raw in some way. +- `util` - A folder of utilities that never needs to be called or touched directly unless troubleshooting something + +There are lots of ideas out there for organizational strategies. Key is finding one that fits your team and your project. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: + +- [Reproducible R example](https://github.com/jhudsl/reproducible-r-example) +- [Jenny Bryan's organizational strategies](https://www.stat.ubc.ca/~jenny/STAT545A/block19_codeFormattingOrganization.html) [@Bryan2021]. +- [Danielle Navarro's organizational strategies](https://www.youtube.com/playlist?list=PLRPB0ZzEYegPiBteC2dRn95TX9YefYFyy) @Navarro2021 +- [Jenny Bryan on Project-oriented workflows](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/)[@Bryan2017]. +- [Data Carpentry mini-course about organizing projects](https://datacarpentry.org/organization-genomics/) [@DataCarpentry2021]. +- [Andrew Severin's strategy for organization](https://bioinformaticsworkbook.org/projectManagement/Intro_projectManagement.html#gsc.tab=0) [@Severin2021]. +- [A BioStars thread where many individuals share their own organizational strategies](https://www.biostars.org/p/821/) [@Biostars2021]. +- [Data Carpentry course chapter about getting organized](https://bioinformatics-core-shared-training.github.io/shell-genomics/07-organization/index.html) [@DataCarpentry2019]. ## Navigate file paths +In point and click apps (called Graphics User Interfaces) you navigate to files by clicking on folders. But for R programming and other command line interfaces, we navigate and use files by using `file paths`. `File paths` are series of folders it takes to get to the file, not unlike a street address. + +To make an analogy, if someone asked you directions to a particular building, the directions you give would be tailored based on where this person located. In other words your directions would be relative to their location. + +But file paths can be *relative* or *absolute*. + +In the same way, your computer can be given absolute directions to a file - basically the directions with absolute directions or they can be relative to where you are calling the command in the computer. + ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1337") ``` +So in our above analogy, if you are trying to direct someone to somewhere on Johns Hopkins’ campus with a file path: + +An absolute file path would be: +`/Earth/North America/United States/Maryland/Baltimore/Johns Hopkins University/Street Name/Building number` + +Whereas if the person was already in Baltimore, a relative file path would be: +`Johns Hopkins University/Street Name/Building number` + +The end of a path string may be a file name if you are creating a path to a file. If you are creating a path to a folder, the path string will end with the destination folder. + +To know your location within a file system is to know exactly what folder you are in right now. The folder that you are in right now is called the `working directory` aka your "Current Location". In the above analogy a person being located in Baltimore would be their working directory. In a path, folder names are separated by forward slashes `/` + +Note that a relative directory may be different between different apps: RStudio versus Terminal versus something else. So you if you switch between the `Console` and `Terminal` tabs, you will have to pay attention to what your `working directory` is. This is also different from the `Files` pane which has no bearing on your working directory either. + +Returning to computer files. In your Terminal you can see your working directory at the top of the Terminal window or at the beginning of the terminal prompt. Knowing this, this can tell you how you need to change the command you are entering. Let’s say you want to list, using the `ls` command, a file called `file.txt`. + +Alternatively, you can specify an absolute path. An absolute path starts at the root directory of a file system. The root directory does not have a name like other folders do. It is specified with a single forward slash `/` and is special in that it cannot be contained within other folders. In the current file system, the root directory contains a folder called cloud, which contains a folder called project which contains the folder `data_analysis_project`, which has the file.txt file we are looking for. ## Handy R Tools From 99440832f95bf416088ba6c745d719cdc4c60724 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 20 Nov 2024 09:32:27 -0500 Subject: [PATCH 2/8] Adding stuff about notebooks --- 05-setting-up.Rmd | 57 +++++++++++++++++++++++++++++++++++++++--- 06-rmarkdown.Rmd | 63 +++++++++++++++++++++++------------------------ 2 files changed, 84 insertions(+), 36 deletions(-) diff --git a/05-setting-up.Rmd b/05-setting-up.Rmd index 1c32389..edde15f 100644 --- a/05-setting-up.Rmd +++ b/05-setting-up.Rmd @@ -48,6 +48,33 @@ The most important aspects of your project organization scheme is that it: ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_421") ``` +### READMEs! + +READMEs are also a great way to help your collaborators get quickly acquainted with the project. + +```{r, fig.align='center', echo = FALSE, fig.alt= "Avi is looking at a set of project files that include a file called a ‘README.md’. Avi says 'I had no idea where to start with this analysis that Ruby sent me to review, but then I saw she included a README and that saved me so much time and effort in getting started!'"} +ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf8379bb805_0_11") +``` + +READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called "README.md" when someone comes to the main page of your repository which further encourages people looking at your project to read the information in your README. + +**Information that should be included in a README:** + +1) General purpose of the project +2) Instructions on how to re-run the project +3) Lists of any software required by the project +4) Input and output file descriptions. +5) Descriptions of any additional tools included in the project? + +You can take a look at this [template README](https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/resources/README-template.md) to get your started. + +#### More about writing READMEs: + +- [How to write a good README file](https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/) +- [How to write an awesome README](https://towardsdatascience.com/how-to-write-an-awesome-readme-68bf4be91f8b) + +### Example organization scheme + Getting more specific, here's some ideas of how to organize your project: - **Make file names informative** to those who don't have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders -- these only serve to make reading in files a nightmare in some programs. @@ -158,10 +185,32 @@ https://bookdown.org/ndphillips/YaRrr/projects-in-rstudio.html http://projecttemplate.net/ -### R Markdown files +### Scientific notebooks (Rmd or qmd) + +The generous use and keeping of notebooks is a useful tool for documentation of the development of an analysis. + +Data analyses can lead one on a winding trail of decisions and side investigations, but notebooks allow you to narrate your thought process as you travel along these analyses explorations! + +```{r, fig.align='center', echo = FALSE, fig.alt= "Ruby is looking at her computer that has a lovely notebook with a heatmap! Ruby says ‘Working from this notebook allows me to interactively develop on my data analysis and write down my thoughts about the process all in one place!’"} +ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf8f405fdab_0_186") +``` + +**Your scientific notebook should include descriptions that describe:** + +#### The purposes of the notebook + +What scientific question are you trying to answer? Describe the dataset you are using to try to answer this and why does it help answer this question? + +#### The rationales behind your decisions + +Describe why a particular code chunk is doing a particular thing -- the more odd the code looks, the greater need for you to describe why you are doing it. + +Describe any particular filters or cutoffs you are using and how did you decide on those? + +For data wrangling steps, why are you wrangling the data in such a way -- is this because a certain package you are using requires it? -https://rmarkdown.rstudio.com/articles_intro.html +#### Your observations of the results -### Quarto files +What do you think about the results? The plots and tables you show in the notebook -- how do they inform your original questions? -https://quarto.org/docs/get-started/hello/rstudio.html +There are two major types of notebooks folks use in the R programming language: R Markdown files and Quarto files. In the next section we will discuss these notebooks, how they are the same, how they are different, and how to use them. diff --git a/06-rmarkdown.Rmd b/06-rmarkdown.Rmd index 4fd42a3..cb85ea5 100644 --- a/06-rmarkdown.Rmd +++ b/06-rmarkdown.Rmd @@ -10,13 +10,13 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ## Reports support reproducibility -Using R Markdown files helps you to create supports that can more transparently show what you did for your analysis and it can help you to test that your code works as expected. Scripts allow you to save code, but they do not allow you to have the following additional benefits. +Using notebooks help you to create supports that can more transparently show what you did for your analysis and it can help you to test that your code works as expected. Scripts allow you to save code, but they do not allow you to have the following additional benefits. The following are reasons why R Markdown files help reproducibility: - They allow you to show and share your code and the output of your code in one place! (this can be done in several ways depending on what you want) - They allow you to test if your code works outside of what is active in your environment -- They allow you to test sections and all previous sections of your code out to troubleshoot +- They allow you to test sections and all previous sections of your code out to troubleshoot - They help you understand what might be wrong with your code in smaller sections of code if you have an issue ## Getting Started with R Markdown @@ -46,7 +46,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ This pane is where we can write code to save in our R Markdown report. -Thus the lower left pane is where we can test out code (although we do not generally recommend it), but the top pane is where we can write code that we wish to save. +Thus the lower left pane is where we can test out code (although we do not generally recommend it), but the top pane is where we can write code that we wish to save. Note that you can also test selected code (or a current line) in an R Markdown file using a keyboard shortcut of Ctrl+Enter on Windows & Linux computers or Cmd+Return on Mac computers. @@ -77,10 +77,10 @@ There is a special `Knit` button that looks like a ball of yarn with a knitting ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "The Knit button at the top of the R Markdown file allows us to create a nice report from the file."} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g1fa1583c827_0_16") ``` - + You will likely be prompted to give the file a name after you press the Knit button and to confirm where you want to save the rendered version. -You will then see in a second or two (after some information is printed on the Render tab in the lower left pane) a screen pop up with the rendered version of the report. +You will then see in a second or two (after some information is printed on the Render tab in the lower left pane) a screen pop up with the rendered version of the report. This will look something like this: @@ -102,7 +102,7 @@ Hopefully you can already start to appreciate how useful it can be to send peopl ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_39") ``` -It's important to note that when we knit an R Markdown file, it will test our code as if we have an empty environment and it will rely on **only the code written in the R Markdown file**. It can't use code that was tested in the Console or run interactively in the R Markdown file (more on that soon). +It's important to note that when we knit an R Markdown file, it will test our code as if we have an empty environment and it will rely on **only the code written in the R Markdown file**. It can't use code that was tested in the Console or run interactively in the R Markdown file (more on that soon). This process really helps with reproducibility because it helps us make sure that all the instructions needed (loading packages, assigning objects, etc) are within the code that we saved in the R Markdown file. @@ -116,7 +116,7 @@ Now let's discuss how to start writing code in such a file. At the top of an R Markdown file you will see some special code that is called [YAML](https://en.wikipedia.org/wiki/YAML) code. It is commonly used to configure programming projects. It does the same for our R Markdown reports. A major difference between R and YAML is that spacing really matters for YAML. -What do we mean by configure? Configuration in programming generally refers to setting things up. +What do we mean by configure? Configuration in programming generally refers to setting things up.
@@ -133,7 +133,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ``` -You can modify the `"Untitled"` text after `title:` to specify the title of your report. If you want to change the author section where it says `"your name"` in the example if that was not by default want you wanted in your file. +You can modify the `"Untitled"` text after `title:` to specify the title of your report. If you want to change the author section where it says `"your name"` in the example if that was not by default want you wanted in your file. ### Code chunks @@ -145,13 +145,13 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ The notation here means the following: -- The three back ticks `"```"` indicate the barriers of where some code will be in between. This is what we call an R chunk. +- The three back ticks `"```"` indicate the barriers of where some code will be in between. This is what we call an R chunk. - The `{r}` indicates that we are going to write the code using R code. - Extra information can be added inside the curly bracket `{}` notation to give the chunk a name, in this case it is called `setup`. - The `include = FALSE` means that it will not show up in the rendered report. -This first chunk tells the document how additional chunks should show up in the rendered report by default. Here it says that code should show up with `echo = TRUE` in the report. You don't need to worry too much about any of this now, just recognize that this is a chunk of code. +This first chunk tells the document how additional chunks should show up in the rendered report by default. Here it says that code should show up with `echo = TRUE` in the report. You don't need to worry too much about any of this now, just recognize that this is a chunk of code. As we scroll past some text within the R Markdown file, we will see another chunk. @@ -190,7 +190,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_
-Writing our code in chunks (as opposed to one long script) can help with reproducibility, as we can better determine where possible changes may have occurred and how that influenced the results in a step-wise fashion, instead of just one final output. +Writing our code in chunks (as opposed to one long script) can help with reproducibility, as we can better determine where possible changes may have occurred and how that influenced the results in a step-wise fashion, instead of just one final output. ### Running previous chunks @@ -201,7 +201,7 @@ You may also notice that there is another button to the left of the play button. ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_297") ``` -This is super helpful for reproducibility in terms of making sure that your code isn't working simply because you have something in your environment from code that you tested in the console but did not save. +This is super helpful for reproducibility in terms of making sure that your code isn't working simply because you have something in your environment from code that you tested in the console but did not save. Issues can happen if you run a code chunk out of order or change the code in a chunk after running it previously. This can make you think that you have all the code that you need saved to obtain the result that you found, when in fact you do not. @@ -209,7 +209,7 @@ Therefore we recommend cleaning the environment (which we will describe in the n ## Cleaning the environment -We suggest cleaning out your environment somewhat regularly when you are interactively testing your R Markdown file using chunks. To do so, you can press the button that looks like a broom in the upper right pane. +We suggest cleaning out your environment somewhat regularly when you are interactively testing your R Markdown file using chunks. To do so, you can press the button that looks like a broom in the upper right pane. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Image showing the location of the broom button to clean the environment."} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_305") @@ -219,14 +219,14 @@ The ultimate test though is to press the `Knit` button and make sure you have al ## Restarting R Session -To really test your code, every once in a while, we suggest restarting your R Session and trying to Knit your R Markdown file to make sure that anything you loaded during your previous session (but didn't save in your code) wasn't allowing your code to run successfully. +To really test your code, every once in a while, we suggest restarting your R Session and trying to Knit your R Markdown file to make sure that anything you loaded during your previous session (but didn't save in your code) wasn't allowing your code to run successfully. -To do so, you can click on the `Session` tab of the upper menu of RStudio and click `Restart R`. +To do so, you can click on the `Session` tab of the upper menu of RStudio and click `Restart R`. ## Chunk setup -You may find that sometimes you want to hide the code in a report, or hide the output. This can be for a variety of reasons. For example, the first chunk that is in every new R Markdown file (when you first open one) is hidden. This is because it sets up how all the other chunks work (by default) and it isn't really important for the analysis. Recall that we hide the code and any output, using `include = FALSE`. If we just want to hide one or the other we can use different specifications. +You may find that sometimes you want to hide the code in a report, or hide the output. This can be for a variety of reasons. For example, the first chunk that is in every new R Markdown file (when you first open one) is hidden. This is because it sets up how all the other chunks work (by default) and it isn't really important for the analysis. Recall that we hide the code and any output, using `include = FALSE`. If we just want to hide one or the other we can use different specifications. The easiest way to do this is to click on the little gear symbol for the R chunk you wish to modify. @@ -244,11 +244,11 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ``` -For reproducibility purposes, we generally suggest that you share the code, however, sometimes reports can get very difficult to read if you have all the code shown. So there are times where you might focus on a particular part of an analysis. We will also describe a nifty trick to allow readers of your report to see the code if they want to, but have it hidden most of the time. +For reproducibility purposes, we generally suggest that you share the code, however, sometimes reports can get very difficult to read if you have all the code shown. So there are times where you might focus on a particular part of an analysis. We will also describe a nifty trick to allow readers of your report to see the code if they want to, but have it hidden most of the time. ## Finding chunks -If your R Markdown file gets really long, it can be difficult to scroll to find the chunk you want to modify. If you name your chunks, or even if you don't, you can more easily move around from one chunk to another using a special menu button created just for this! +If your R Markdown file gets really long, it can be difficult to scroll to find the chunk you want to modify. If you name your chunks, or even if you don't, you can more easily move around from one chunk to another using a special menu button created just for this! There is a very small menu at the bottom of the R Markdown file editor that helps you move around. It will look slightly different depending on what your chunks are named, but will have a gold hashtag button. @@ -259,7 +259,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ## Add chunks -To add new chunks you can either click on the chunk button on the top right of the R Markdown editor, which looks like a green square with a "C" in it and a plus sign on the corner. +To add new chunks you can either click on the chunk button on the top right of the R Markdown editor, which looks like a green square with a "C" in it and a plus sign on the corner. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "The button to add new chunks is located on the upper right corner of the R Markdown Editor. It looks like a green square with a C in it. "} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_345") @@ -288,9 +288,9 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ## Text and headers -You will notice that there is text written around the code chunks that you can use to describe what you did in your analysis and why. +You will notice that there is text written around the code chunks that you can use to describe what you did in your analysis and why. -There are a couple of formatting options that can be very useful to know. +There are a couple of formatting options that can be very useful to know. If you want to know more, you can check out this [guide](https://www.markdownguide.org/) about Markdown in general. The syntax will be the same for R Markdown files too. @@ -305,9 +305,9 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ### Bold and Italics -Bold text can be created using `**` around the text. +Bold text can be created using `**` around the text. -Italic text can be created using `*` around the text. +Italic text can be created using `*` around the text. To do both you can use `***` around the text. @@ -331,9 +331,9 @@ We also recommend checking out the [R Markdown cookbook](https://bookdown.org/yi ### Aesthetics -Sometimes we might want to make our reports look a little nicer, perhaps we want to add branding that matches that of our institute or at least makes the report look really polished. +Sometimes we might want to make our reports look a little nicer, perhaps we want to add branding that matches that of our institute or at least makes the report look really polished. -You can make changes to the aesthetics of the report in very few steps. +You can make changes to the aesthetics of the report in very few steps. First locate the settings button for the R Markdown editor, which looks like a gear an is located next to the `Knit` button. @@ -347,7 +347,7 @@ Then scroll down and select "Output Options". This menu also has nice features i ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_463") ``` -This will open a new window that has a dropdown that you can use to apply a theme to the report. +This will open a new window that has a dropdown that you can use to apply a theme to the report. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Use the dropdown menu next to the Apply theme section to change the theme of your R Markdown report."} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_376 @@ -384,7 +384,7 @@ If you like to work with keyboard shortcuts instead of pointing and clicking, yo ### Table of Contents -Sometimes if your report is very long, it can help to add a table of contents. +Sometimes if your report is very long, it can help to add a table of contents. This can be done by adding `toc: true` and `toc_float: true` to the YAML underneath the `html_document:` code. The spacing is very important with this! The `toc_float: true` makes the table of contents on the side as opposed to just the top. @@ -395,7 +395,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ### Code Folding -Earlier we talked about hiding code but discussed that usually you want to share the code if possible. Code folding is really great option for this issue! It allows you to create a clean report with a button for people to click to see the code within the code chunk that resulted in the various outputs of the report. +Earlier we talked about hiding code but discussed that usually you want to share the code if possible. Code folding is really great option for this issue! It allows you to create a clean report with a button for people to click to see the code within the code chunk that resulted in the various outputs of the report. To do this you can add `code_folding: 'hide'` to cause your code to be "folded". @@ -403,7 +403,7 @@ To do this you can add `code_folding: 'hide'` to cause your code to be "folded" ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_521") ``` -This means that there will be a button that people can click on to see the code (or hide it afterwards). +This means that there will be a button that people can click on to see the code (or hide it afterwards). ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Code folding allows others to click on a code button to show the code, they can then click hide to hide it afterwards."} @@ -433,11 +433,11 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ``` -This trick is great for reproducibility because it ensures that the date on the report is correct for when the report was last rendered. This helps those who read the report to get a sense of how active development is on the project. +This trick is great for reproducibility because it ensures that the date on the report is correct for when the report was last rendered. This helps those who read the report to get a sense of how active development is on the project. ## Conclusion -In summary, R Markdown files can help you to create nice looking reports that help others to understand not only what code you used, but also what the results of your code were. +In summary, R Markdown files can help you to create nice looking reports that help others to understand not only what code you used, but also what the results of your code were. - Code is written in gray sections called chunks that have play buttons that allow you to preview the code - The Knit button allows you to render the full report and test that all of the needed code is in the file @@ -450,4 +450,3 @@ In summary, R Markdown files can help you to create nice looking reports that he - hashtags are used to create headers, the fewer the hashtags the larger the header - Asterisk around text creates bold or italic font - There are additional features to make your R Markdown report showcase your code and the output of your code in more readable ways, including adding a table of contents or folding code, so that readers can click to see the code that created a particular output. This is a really great option for reproducibility because it creates easy to read reports but also shares your code! - From 455ff9c4950981369ebaca18e17da3dbfa95fd3f Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 20 Nov 2024 09:50:18 -0500 Subject: [PATCH 3/8] R markdown description --- 06-rmarkdown.Rmd | 34 ++++++++++++++++++++++++++++++---- 1 file changed, 30 insertions(+), 4 deletions(-) diff --git a/06-rmarkdown.Rmd b/06-rmarkdown.Rmd index cb85ea5..d13dc57 100644 --- a/06-rmarkdown.Rmd +++ b/06-rmarkdown.Rmd @@ -10,7 +10,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ## Reports support reproducibility -Using notebooks help you to create supports that can more transparently show what you did for your analysis and it can help you to test that your code works as expected. Scripts allow you to save code, but they do not allow you to have the following additional benefits. +Using notebooks help you to create supports that can more transparently show what you did for your analysis and it can help you to test that your code works as expected. Scripts allow you to save code, but they do not allow you to have the following additional benefits. The following are reasons why R Markdown files help reproducibility: @@ -19,9 +19,33 @@ The following are reasons why R Markdown files help reproducibility: - They allow you to test sections and all previous sections of your code out to troubleshoot - They help you understand what might be wrong with your code in smaller sections of code if you have an issue -## Getting Started with R Markdown +## R Markdown or Quarto? -OK, so now we know how to make a new R Markdown file from the previous section. +Both R Markdown and Quarto are types of notebooks that have similar functions. R Markdown files end with the suffix `.Rmd` while quarto files end with `.qmd`. + +Both Qmd and Rmd files are both notebooks that have the benefits we've described above. They allow you to document using the markdown language. Plus, because they are so similar you can often just change the suffix of your file and convert between these file types (results may vary depending on the content of the file). + +R Markdown was the first R programming notebook on the scene, and has a lot of tools devoted to it because it has been around awhile. In 2022, [Posit released the Quarto notebook](https://posit.co/blog/announcing-quarto-a-new-scientific-and-technical-publishing-system/). So Quarto has a lot of great new features but is still relatively new. + +Posit created Quarto with the idea of streamlining document making and allowing for more compatibility with languages that are not R. While R Markdown documents also (sort of) allow for other languages to be run in them, their ability to do this successfully is limited. + +### R Markdown Pros: +- Time tested, a lot of packages and resources built for it. +- Fundamentally an R notebook and is built around that. + +### R Markdown Cons: +- Does not always do well running other languages (like Python). +- Does require a lot of extra packages to be installed to do more things with it: `bookdown`, `distill`, etc. + +### Quarto Pros: +- Built with more compatability for other languages +- Appears to be more streamlined/centralized and less need for a lot of extra packages to create other types of documents. + +### Quarto Cons: +- It is still quite new, and the community is still catching up to it. Though it appears to be built with backwards compatibility in mind. +- Because it so new, there are still features they are continuing to build for it that R Markdown has that it may not just yet. At this point, these are mostly very fringe features, but it could be something an individual user is interested in. + +## Getting Started with notebooks
Click here for a review on how to create R Markdown files in RStudio. @@ -69,7 +93,9 @@ Once open the file your RStudio should look something like this: ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_0") ``` -## Rendering R Markdown files +## Rendering R Markdown + +For this first chapter we will introduce you to R Markdown files, but note there are many [great and continually new emerging tutorials to introduce to Quarto notebooks](https://quarto.org/docs/guide/). Most of what we discuss about R Markdown files is also applicable to Quarto and you can often just switch the suffix of your file and have *most* of your features and code still work. There is a special `Knit` button that looks like a ball of yarn with a knitting needle at the top of the R Markdown files that helps you create your report. Since R Markdown files by default have some code, we can press this to see what a rendered report might look like before we start writing our own code. From a014c0f2e70d42520a21dc2fd90f2aee20b059bc Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Wed, 20 Nov 2024 09:54:18 -0500 Subject: [PATCH 4/8] Spelling fixes --- 06-rmarkdown.Rmd | 4 ++-- resources/dictionary.txt | 8 ++++++++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/06-rmarkdown.Rmd b/06-rmarkdown.Rmd index d13dc57..48bf055 100644 --- a/06-rmarkdown.Rmd +++ b/06-rmarkdown.Rmd @@ -38,7 +38,7 @@ Posit created Quarto with the idea of streamlining document making and allowing - Does require a lot of extra packages to be installed to do more things with it: `bookdown`, `distill`, etc. ### Quarto Pros: -- Built with more compatability for other languages +- Built with more compatibility for other languages - Appears to be more streamlined/centralized and less need for a lot of extra packages to create other types of documents. ### Quarto Cons: @@ -95,7 +95,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ## Rendering R Markdown -For this first chapter we will introduce you to R Markdown files, but note there are many [great and continually new emerging tutorials to introduce to Quarto notebooks](https://quarto.org/docs/guide/). Most of what we discuss about R Markdown files is also applicable to Quarto and you can often just switch the suffix of your file and have *most* of your features and code still work. +For this first chapter we will introduce you to R Markdown files, but note there are many [great and continually new emerging tutorials to introduce to Quarto notebooks](https://quarto.org/docs/guide/). Most of what we discuss about R Markdown files is also applicable to Quarto and you can often just switch the suffix of your file and have *most* of your features and code still work. There is a special `Knit` button that looks like a ball of yarn with a knitting needle at the top of the R Markdown files that helps you create your report. Since R Markdown files by default have some code, we can press this to see what a rendered report might look like before we start writing our own code. diff --git a/resources/dictionary.txt b/resources/dictionary.txt index 5b482a3..59ffc63 100644 --- a/resources/dictionary.txt +++ b/resources/dictionary.txt @@ -1,3 +1,11 @@ +Maintainably +Hopkins +BioStars +READing +Qmd +qmd +Rmd +Severin's chatbots ChatBots ChatGPT From a18f9a2aeb911322b8df36ac065b9608839157cf Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 10 Dec 2024 11:49:52 -0500 Subject: [PATCH 5/8] Applying Carrie suggestions part 1 Co-authored-by: Carrie Wright <23014755+carriewright11@users.noreply.github.com> --- 05-setting-up.Rmd | 30 ++++++++++++++++++------------ 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/05-setting-up.Rmd b/05-setting-up.Rmd index edde15f..200f2ea 100644 --- a/05-setting-up.Rmd +++ b/05-setting-up.Rmd @@ -15,10 +15,10 @@ Keeping your files organized is a skill that has a high long-term payoff. As you @Tayo2019 discusses four particular reasons why it is important to organize your project: -> 1. Organization increases productivity. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. -> 2. A well-organized project helps you to keep and maintain a record of your ongoing and completed data science projects. -> 3. Completed data science projects could be used for building future models. If you have to solve a similar problem in the future, you can use the same code with slight modifications. -> 4. A well-organized project can easily be understood by other data science professionals when shared on platforms such as Github. +> 1. Organization **increases productivity**. If a project is well organized, with everything placed in one directory, it makes it easier to avoid wasting time searching for project files such as datasets, codes, output files, and so on. +> 2. A well-organized project helps you to keep and **maintain a record** of your ongoing and completed data science projects. +> 3. Completed data science projects could be used for **building future models**. If you have to solve a similar problem in the future, you can use the same code with slight modifications. +> 4. A well-organized project can **easily be understood** by other data science professionals when shared on platforms such as Github. Organization is yet another aspect of reproducibility that saves you and your colleagues time! @@ -28,13 +28,13 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ## General principles of project organization -Project organization should work for you and not the other way around. Maintainably organized should be the goal. Something that is effective but can be maintained long term. As you can imagine, organizational schemes are not one size fits all. +Project organization should work for you and not the other way around. The goal should be organization that is maintainable long term. As you might imagine, the optimal organizational scheme might differ from one individual to another or even one project to another. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_426") ``` -There's a lot of ways to keep your files organized, and there's not a "one size fits all" organizational solution [@Shapiro2021]. In this chapter, we will discuss some generalities but as far as specifics we will point you to others who have written about works for them and advise that you use them as inspiration to figure out a strategy that works for you and your team. +There's a lot of ways to keep your files organized, and there's not a "one size fits all" organizational solution [@Shapiro2021]. In this chapter, we will discuss some generalities; but for specifics, we will point you to others who have written about what works for them. We suggest that you use them as inspiration to figure out a strategy that works for you and your team. The most important aspects of your project organization scheme is that it: @@ -56,28 +56,34 @@ READMEs are also a great way to help your collaborators get quickly acquainted w ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF6KS9QmJ5NBjwWVjRn40MS9f2noE/edit#slide=id.gf8379bb805_0_11") ``` -READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called "README.md" when someone comes to the main page of your repository which further encourages people looking at your project to read the information in your README. +READMEs stick out in a project and are generally universal signal for new people to the project to start by READing them. GitHub automatically will preview your file called "README.md" when someone comes to the main page of your repository. This further encourages people looking at your project to read the information in your README. **Information that should be included in a README:** 1) General purpose of the project 2) Instructions on how to re-run the project 3) Lists of any software required by the project -4) Input and output file descriptions. -5) Descriptions of any additional tools included in the project? +4) Input and output file descriptions +5) Descriptions of any additional tools included in the project You can take a look at this [template README](https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/resources/README-template.md) to get your started. #### More about writing READMEs: -- [How to write a good README file](https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/) -- [How to write an awesome README](https://towardsdatascience.com/how-to-write-an-awesome-readme-68bf4be91f8b) +- [How to write a good README file by Hillary Nyakundi](https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/) +- [How to write an awesome README by Navendu Pottekkat](https://towardsdatascience.com/how-to-write-an-awesome-readme-68bf4be91f8b) + +#### Examples of good READMEs: + +- https://github.com/stephaniehicks/qsmooth +- https://github.com/lcolladotor/derfinder +- https://github.com/tidyverse/dplyr ### Example organization scheme Getting more specific, here's some ideas of how to organize your project: -- **Make file names informative** to those who don't have knowledge of the project but avoid using spaces, quotes, or unusual characters in your filenames and folders -- these only serve to make reading in files a nightmare in some programs. +- **Make file names informative** to those who don't have knowledge of the project -- but avoid using spaces, quotes, or unusual characters in your filenames and folders, as these can make reading in files a nightmare with some programs. - **Number scripts** in the order that they are run. - **Keep like-files together** in their own directory: results tables with other results tables, etc. _Including most importantly keeping raw data separate from processed data or other results!_ - **Put source scripts and functions in their own directory**. Things that should never need to be called directly by yourself or anyone else. From b664e1847a5ec44fd3dd37a246a1ec77dfa84b34 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Tue, 10 Dec 2024 11:52:49 -0500 Subject: [PATCH 6/8] Incorporating Carrie Suggestions part 2 Co-authored-by: Carrie Wright <23014755+carriewright11@users.noreply.github.com> --- 05-setting-up.Rmd | 43 +++++++++++++++++++++++++------------------ 06-rmarkdown.Rmd | 10 +++++----- 2 files changed, 30 insertions(+), 23 deletions(-) diff --git a/05-setting-up.Rmd b/05-setting-up.Rmd index 200f2ea..d4e33e7 100644 --- a/05-setting-up.Rmd +++ b/05-setting-up.Rmd @@ -92,7 +92,7 @@ Getting more specific, here's some ideas of how to organize your project: - Make it easy on yourself, **dates aren't necessary**. The computer keeps track of those. - **Make a central script that re-runs everything** -- including the creation of the folders! (more on this in a later chapter) -Let's see what these principles might look like put into practice. +Let's see what these principles might look in practice. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_442") @@ -121,17 +121,17 @@ project-name/ **What these hypothetical files and folders contain:** -- `run_analysis.sh` - A central script that runs everything again +- `run_analysis.sh` - A central script that runs everything - `00-download-data.sh` - The script that needs to be run first and is called by run_analysis.sh - `01-make-heatmap.Rmd` - The script that needs to be run second and is also called by run_analysis.sh -- `README.md` - The document that has the information that will orient someone to this project, we'll discuss more about how to create a helpful README in [an upcoming chapter](https://jhudatascience.org/Reproducibility_in_Cancer_Informatics/documenting-analyses.html#readmes). +- `README.md` - The document that has the information that will orient someone to this project - `plots` - A folder of plots and resulting images -- `results` - A folder results -- `raw-data` - Data files as they first arrive and **nothing** has been done to them yet. -- `processed-data` - Data that has been modified from the raw in some way. +- `results` - A folder of results +- `raw-data` - Data files as they first arrive and **nothing** has been done to them yet +- `processed-data` - Data that has been modified from the raw in some way - `util` - A folder of utilities that never needs to be called or touched directly unless troubleshooting something -There are lots of ideas out there for organizational strategies. Key is finding one that fits your team and your project. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: +There are lots of ideas out there for organizational strategies. The key is finding one that fits your team and your project. You can read through some of these articles to think about what kind of organizational strategy might work for you and your team: - [Reproducible R example](https://github.com/jhudsl/reproducible-r-example) - [Jenny Bryan's organizational strategies](https://www.stat.ubc.ca/~jenny/STAT545A/block19_codeFormattingOrganization.html) [@Bryan2021]. @@ -144,13 +144,13 @@ There are lots of ideas out there for organizational strategies. Key is finding ## Navigate file paths -In point and click apps (called Graphics User Interfaces) you navigate to files by clicking on folders. But for R programming and other command line interfaces, we navigate and use files by using `file paths`. `File paths` are series of folders it takes to get to the file, not unlike a street address. +In point and click apps (called [Graphical User Interfaces (or GUI pronounced like the word gooey)](https://en.wikipedia.org/wiki/Graphical_user_interface) you navigate to files by clicking on folders. But for R programming and other command line interfaces, we navigate and use files by using `file paths`. `File paths` are the series of folders that it takes to get to a file, not unlike a street address. -To make an analogy, if someone asked you directions to a particular building, the directions you give would be tailored based on where this person located. In other words your directions would be relative to their location. +To make an analogy, if someone asked you directions to a particular building, the directions you would give would be tailored based on where the person asking is located. In other words your directions would be relative to their location. But file paths can be *relative* or *absolute*. -In the same way, your computer can be given absolute directions to a file - basically the directions with absolute directions or they can be relative to where you are calling the command in the computer. +Your computer can be given directions relative to where you are calling the command in the computer or they can be absolute directions to a file - basically the full directions to that file, regardless of where you might be already on your computer. ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1337") @@ -167,7 +167,7 @@ The end of a path string may be a file name if you are creating a path to a file To know your location within a file system is to know exactly what folder you are in right now. The folder that you are in right now is called the `working directory` aka your "Current Location". In the above analogy a person being located in Baltimore would be their working directory. In a path, folder names are separated by forward slashes `/` -Note that a relative directory may be different between different apps: RStudio versus Terminal versus something else. So you if you switch between the `Console` and `Terminal` tabs, you will have to pay attention to what your `working directory` is. This is also different from the `Files` pane which has no bearing on your working directory either. +Note that a relative directory may be different between different apps: RStudio versus Terminal versus something else. So you if you switch between the `Console` and `Terminal` tabs, you will have to pay attention to what your `working directory` is. This is also different from the `Files` pane which has no bearing on your working directory either. The terminal tab is located in the Console pane in RStudio, which is usually the lower left pane (with default settings). You can use the terminal to work with files using the command line. Returning to computer files. In your Terminal you can see your working directory at the top of the Terminal window or at the beginning of the terminal prompt. Knowing this, this can tell you how you need to change the command you are entering. Let’s say you want to list, using the `ls` command, a file called `file.txt`. @@ -193,7 +193,7 @@ http://projecttemplate.net/ ### Scientific notebooks (Rmd or qmd) -The generous use and keeping of notebooks is a useful tool for documentation of the development of an analysis. +Using notebooks can be a very helpful tool for documenting the development of an analysis. Data analyses can lead one on a winding trail of decisions and side investigations, but notebooks allow you to narrate your thought process as you travel along these analyses explorations! @@ -205,18 +205,25 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1LMurysUhCjZb7DVF #### The purposes of the notebook -What scientific question are you trying to answer? Describe the dataset you are using to try to answer this and why does it help answer this question? +It can be helpful to others and your future self to describe: + +- The scientific question are you trying to answer +- The dataset you are using to try to answer this question +- An explanation for the choice of the dataset to help answer this question #### The rationales behind your decisions -Describe why a particular code chunk is doing a particular thing -- the more odd the code looks, the greater need for you to describe why you are doing it. +Describe major code decisions. For example, why you chose to use specific packages or why you took certain steps in that specific order. This can be very general to very specific, such as why a particular code chunk is doing a particular thing. The more possible options there were for choices or the more unusual a process that you might have taken, the greater the need to describe why you made certain decisions. -Describe any particular filters or cutoffs you are using and how did you decide on those? +Describe any particular filters or cutoffs you are using and how did you decided on those. -For data wrangling steps, why are you wrangling the data in such a way -- is this because a certain package you are using requires it? +For data wrangling steps, describe why you are wrangling the data in such a way. Is this because a certain package you are using requires it? #### Your observations of the results -What do you think about the results? The plots and tables you show in the notebook -- how do they inform your original questions? +In this section it is helpful to include: + +- What do you currently think about the results? +- What do you think about the plots and tables you show in the notebook -- how do they inform your original questions? -There are two major types of notebooks folks use in the R programming language: R Markdown files and Quarto files. In the next section we will discuss these notebooks, how they are the same, how they are different, and how to use them. +There are two major types of notebooks folks use in the R programming language: R Markdown files and Quarto files. In the next section we will discuss these notebooks, the similarities and differences between these two options, and how to use them. diff --git a/06-rmarkdown.Rmd b/06-rmarkdown.Rmd index 48bf055..9248d0c 100644 --- a/06-rmarkdown.Rmd +++ b/06-rmarkdown.Rmd @@ -8,15 +8,15 @@ ottrpal::set_knitr_image_path() ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g20eecbcf66d_84_0") ``` -## Reports support reproducibility +## Notebook reports support reproducibility -Using notebooks help you to create supports that can more transparently show what you did for your analysis and it can help you to test that your code works as expected. Scripts allow you to save code, but they do not allow you to have the following additional benefits. +Using notebooks can help you more transparently show what you did for your analysis. They can also help you to test that your code works as expected. Scripts allow you to save code, but they do not allow you to have the following additional benefits. -The following are reasons why R Markdown files help reproducibility: +The following are reasons why notebooks help reproducibility: -- They allow you to show and share your code and the output of your code in one place! (this can be done in several ways depending on what you want) +- They allow you to show and share your code and the output of your code in one place! (This can be done in several ways depending on what you want.) - They allow you to test if your code works outside of what is active in your environment -- They allow you to test sections and all previous sections of your code out to troubleshoot +- They allow you to test sections and all previous sections of your code, which can help with troubleshooting - They help you understand what might be wrong with your code in smaller sections of code if you have an issue ## R Markdown or Quarto? From f3c9ad0ca00d227479db5ca306c25b4ef69353ca Mon Sep 17 00:00:00 2001 From: carriewright11 Date: Wed, 18 Dec 2024 14:21:07 -0500 Subject: [PATCH 7/8] fixing spelling --- 05-setting-up.Rmd | 2 +- resources/dictionary.txt | 4 ++++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/05-setting-up.Rmd b/05-setting-up.Rmd index d4e33e7..ac815b2 100644 --- a/05-setting-up.Rmd +++ b/05-setting-up.Rmd @@ -155,7 +155,7 @@ Your computer can be given directions relative to where you are calling the comm ```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1337") ``` -So in our above analogy, if you are trying to direct someone to somewhere on Johns Hopkins’ campus with a file path: +So in our above analogy, if you are trying to direct someone to somewhere on the Johns Hopkins campus with a file path: An absolute file path would be: `/Earth/North America/United States/Maryland/Baltimore/Johns Hopkins University/Street Name/Building number` diff --git a/resources/dictionary.txt b/resources/dictionary.txt index b985316..5e9d72b 100644 --- a/resources/dictionary.txt +++ b/resources/dictionary.txt @@ -75,14 +75,18 @@ ITN fyi Leanpub Markua +md mentorship +Navendu NCI NHGRI NIGMS +Nyakundi ottrpal Pandoc PBC PII +Pottekkat ProjectTemplate reproducibility RStudio From 75eff5e67fe709415ec4a8537c875a78fb4b87d2 Mon Sep 17 00:00:00 2001 From: Candace Savonen Date: Thu, 19 Dec 2024 10:29:53 -0500 Subject: [PATCH 8/8] Apply suggestions from code review Co-authored-by: Carrie Wright <23014755+carriewright11@users.noreply.github.com> --- 05-setting-up.Rmd | 32 +++++++++++++++++++++++++++----- 06-rmarkdown.Rmd | 12 ++++++------ 2 files changed, 33 insertions(+), 11 deletions(-) diff --git a/05-setting-up.Rmd b/05-setting-up.Rmd index ac815b2..8eee926 100644 --- a/05-setting-up.Rmd +++ b/05-setting-up.Rmd @@ -65,7 +65,7 @@ READMEs stick out in a project and are generally universal signal for new people 3) Lists of any software required by the project 4) Input and output file descriptions 5) Descriptions of any additional tools included in the project - +6) License for how your materials should be used You can take a look at this [template README](https://raw.githubusercontent.com/jhudsl/Reproducibility_in_Cancer_Informatics/main/resources/README-template.md) to get your started. #### More about writing READMEs: @@ -79,6 +79,17 @@ You can take a look at this [template README](https://raw.githubusercontent.com/ - https://github.com/lcolladotor/derfinder - https://github.com/tidyverse/dplyr +#### Licensing + +Adding information about a license is not always required, but it can be a good idea. If you put your code on GitHub, then the default copyright laws apply. According to GitHub: + +> "You retain all rights to your source code and no one may reproduce, distribute, or create derivative works from your work. If you're creating an open source project, we strongly encourage you to include an open source license." + +::: dictionary +Open source software or code means that it is distributed with a license that allows others to reuse or adapt your code for other purposes. This is very helpful to advance science and technology. +::: + +Check out this great resource on [options for licenses](https://choosealicense.com/) to help you choose which license is right for your project. ### Example organization scheme Getting more specific, here's some ideas of how to organize your project: @@ -89,7 +100,7 @@ Getting more specific, here's some ideas of how to organize your project: - **Put source scripts and functions in their own directory**. Things that should never need to be called directly by yourself or anyone else. - **Put output in its own directories** like `results` and `plots`. - **Have a central document (like a README)** that describes the basic information about the analysis and how to re-run it. -- Make it easy on yourself, **dates aren't necessary**. The computer keeps track of those. +- Make it easy on yourself, **dates aren't necessary** to track for file updates. The computer keeps track of when a file was updated. - **Make a central script that re-runs everything** -- including the creation of the folders! (more on this in a later chapter) Let's see what these principles might look in practice. @@ -152,7 +163,7 @@ But file paths can be *relative* or *absolute*. Your computer can be given directions relative to where you are calling the command in the computer or they can be absolute directions to a file - basically the full directions to that file, regardless of where you might be already on your computer. -```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Major point!! example image"} +```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "A relative path might be from the local neighborhood to johns hopkins, where as a relative path is analogous to a path that could direct you from further away, so state information would also be included"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g2fea8805c08_0_1337") ``` So in our above analogy, if you are trying to direct someone to somewhere on the Johns Hopkins campus with a file path: @@ -169,9 +180,20 @@ To know your location within a file system is to know exactly what folder you ar Note that a relative directory may be different between different apps: RStudio versus Terminal versus something else. So you if you switch between the `Console` and `Terminal` tabs, you will have to pay attention to what your `working directory` is. This is also different from the `Files` pane which has no bearing on your working directory either. The terminal tab is located in the Console pane in RStudio, which is usually the lower left pane (with default settings). You can use the terminal to work with files using the command line. -Returning to computer files. In your Terminal you can see your working directory at the top of the Terminal window or at the beginning of the terminal prompt. Knowing this, this can tell you how you need to change the command you are entering. Let’s say you want to list, using the `ls` command, a file called `file.txt`. +Returning to computer files, we can have relative or absolute paths based on where we are on the computer. If we are looking for a file in a directory that is on the desktop, then we can have a path from the desktop that is shorter than the absolute path which would identify where the file is overall. + +```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "Relative path on a computer might be from the desktop to a file in a directory called work and would simply be work/file.txt, while an absolute path would be the full path to the directory you might want to work with such as Users/reproducibilityparrot/desktop/work/file.txt"} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g31fc7298e99_0_0")``` + + +In your Terminal you can see your working directory at the top of the Terminal window or at the beginning of the terminal prompt. Knowing this, this can tell you how you need to change the command you are entering. Let’s say you want to list, using the `ls` command, a file called `file.txt`. + + + + +```{r, fig.align='center', out.width="100%", echo = FALSE, fig.alt= "We can use the terminal tab to see our files or the files pane tab. We can list files in the Terminal tab with the command ls. We can also navigate around within the file pane to see files."} ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g31fc7298e99_0_40")``` + -Alternatively, you can specify an absolute path. An absolute path starts at the root directory of a file system. The root directory does not have a name like other folders do. It is specified with a single forward slash `/` and is special in that it cannot be contained within other folders. In the current file system, the root directory contains a folder called cloud, which contains a folder called project which contains the folder `data_analysis_project`, which has the file.txt file we are looking for. +An absolute path starts at the root directory of a file system. The root directory does not have a name like other folders do. It is specified with a single forward slash `/` and is special in that it cannot be contained within other folders. ## Handy R Tools diff --git a/06-rmarkdown.Rmd b/06-rmarkdown.Rmd index 9248d0c..f320cef 100644 --- a/06-rmarkdown.Rmd +++ b/06-rmarkdown.Rmd @@ -27,7 +27,7 @@ Both Qmd and Rmd files are both notebooks that have the benefits we've described R Markdown was the first R programming notebook on the scene, and has a lot of tools devoted to it because it has been around awhile. In 2022, [Posit released the Quarto notebook](https://posit.co/blog/announcing-quarto-a-new-scientific-and-technical-publishing-system/). So Quarto has a lot of great new features but is still relatively new. -Posit created Quarto with the idea of streamlining document making and allowing for more compatibility with languages that are not R. While R Markdown documents also (sort of) allow for other languages to be run in them, their ability to do this successfully is limited. +Posit created Quarto with the idea of streamlining document making by allowing for more compatibility with languages beyond R. While R Markdown documents also somewhat allow for other languages, their ability to do this successfully is limited. ### R Markdown Pros: - Time tested, a lot of packages and resources built for it. @@ -42,8 +42,8 @@ Posit created Quarto with the idea of streamlining document making and allowing - Appears to be more streamlined/centralized and less need for a lot of extra packages to create other types of documents. ### Quarto Cons: -- It is still quite new, and the community is still catching up to it. Though it appears to be built with backwards compatibility in mind. -- Because it so new, there are still features they are continuing to build for it that R Markdown has that it may not just yet. At this point, these are mostly very fringe features, but it could be something an individual user is interested in. +- It is still quite new, and the community is still catching up to it, although it appears to be built with backwards compatibility in mind. +- Because it so new, there are still some features that are being developed for Quarto that R Markdown already supports. At this point, these are mostly features that would allow for customization. ## Getting Started with notebooks @@ -159,7 +159,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ ``` -You can modify the `"Untitled"` text after `title:` to specify the title of your report. If you want to change the author section where it says `"your name"` in the example if that was not by default want you wanted in your file. +You can modify the `"Untitled"` text after `title:` to specify the title of your report. If you want to you can also change the author section where it says `"your name"` in the example. ### Code chunks @@ -171,7 +171,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_ The notation here means the following: -- The three back ticks `"```"` indicate the barriers of where some code will be in between. This is what we call an R chunk. +- The three back ticks `"```"` mark the boundaries of where code should be placed. This is what we call a code chunk. - The `{r}` indicates that we are going to write the code using R code. - Extra information can be added inside the curly bracket `{}` notation to give the chunk a name, in this case it is called `setup`. - The `include = FALSE` means that it will not show up in the rendered report. @@ -227,7 +227,7 @@ You may also notice that there is another button to the left of the play button. ottrpal::include_slide("https://docs.google.com/presentation/d/1MNHf8JpolaEP_vQ_kB-1xRBF9wo3haCArRu117hBoHA/edit#slide=id.g229ab7a949e_0_297") ``` -This is super helpful for reproducibility in terms of making sure that your code isn't working simply because you have something in your environment from code that you tested in the console but did not save. +This is super helpful for reproducibility in terms of making sure that your code is working properly with all the necessary pieces. Sometimes code just works during an R session (and not after) simply because it is relying on an object or code currently in our environment that is not saved in our notebook. For example, code that was tested in the console but not saved will not be run the next time we try to knit our R Markdown file. Issues can happen if you run a code chunk out of order or change the code in a chunk after running it previously. This can make you think that you have all the code that you need saved to obtain the result that you found, when in fact you do not.