Session

1 Initialise R environment

I run this at the start of every session as a base.

n.b. Additional packages are imported in as needed.

##### Set libPaths and memory/parallel cores usage #####
.libPaths(c("/home/chongmorrison/R/x86_64-pc-linux-gnu-library/4.3",
            "/home/chongmorrison/R/4.4.1-Bioc3.19"))
.libPaths() # check .libPaths

options(Ncpus = 12) # adjust no. of cores for base R
getOption("Ncpus", 1L)

library(future)
options(future.globals.maxSize = 2000 * 1024^2) # adjust limit of allowable object size = 2G
# *could* enable parallelisation i.e. workers > 1, for Seurat etc. but breaks Python processes currently...
future::plan("multisession", workers = 1) # or use "sequential" mode
future::plan()

library(BiocParallel) # for parallelising Bioconductor packages

##### Import analysis packages #####
# Potential bug: Python environment needs to be activated before loading Seurat
library(reticulate)
conda_list() # check available conda environments
use_condaenv("singlecell-scHPF", required=TRUE) # has leidenalg etc. installed
# R packages
library(Seurat)
library(tidyverse)
library(dittoSeq)
library(SingleCellExperiment)
library(grateful)

# set seed for reproducibility of random sampling
set.seed(584)

2 Ensembl annotations

Useful to have gene annotations in CSV format at hand, which can be called into the analysis anytime.

library(AnnotationHub)
library(ensembldb)
# Connect to AnnotationHub
ah <- AnnotationHub()
# Access the Ensembl database for organism
ahDb <- query(ah, 
              pattern = c("Danio rerio", "EnsDb"), 
              ignore.case = TRUE)
# Acquire the latest annotation files
id <- ahDb %>%
  mcols() %>%
  rownames() %>%
  tail(n = 1)
# Download the appropriate Ensembldb database
edb <- ah[[id]]
# Extract gene-level information from database
annotations <- genes(edb, 
                     return.type = "data.frame")
# Select annotations of interest
annotations <- annotations %>%
  dplyr::select(gene_id, gene_name, seq_name, gene_biotype, description)
# Save for later use
write.csv(annotations, file="./annotations/ensembl_annotations.csv")

3 Other databases

4 Zebrafish Information Network (ZFIN)

4.1 Gene expression data

ZFIN has an excellent curation of in vivo expression patterns obtained via WISH (Whole-mount in situ hybridisation). As an example, the following code retrieves WISH data for Wild Type condition from https://zfin.org/downloads (‘Gene Expression’ > ‘Expression data for wildtype fish’).

gex <- read.delim(url("https://zfin.org/downloads/wildtype-expression_fish.txt"), header = FALSE, sep ="\t")
head(gex, 5)

# Add column IDs (based on Column Headers in the Downloads page above)
colnames(gex) <- c("GeneID", "GeneSymbol","FishName","SuperStructureID","SuperStructureName",
                   "SubStructureID","SubStructureName","StartStage","EndStage","Assay",
                   "AssayMMOID","PublicationID","ProbeID","AntibodyID","FishID")

Here, the data is filtered to only include information-of-interest e.g. Structure i.e. anatomical information.

gex_ISH <- gex[which(gex$FishName=='WT' | gex$FishName=='AB/TU'), ]
gex_ISH <- gex_ISH[which(gex_ISH$Assay=='mRNA in situ hybridization'), ]
gex_ISH <- data.frame(gex_ISH$GeneSymbol, gex_ISH$SuperStructureName)
colnames(gex_ISH) <- c("GeneSymbol","Structure")
head(gex_ISH, 10)

4.2 Human orthologue information

ZFIN_human <- read.delim(url("https://zfin.org/downloads/human_orthos.txt"), header = FALSE, sep ="\t")
head(ZFIN_human, 5)
# retrieve fish and human orthologues
ZFIN_human <- unique(data.frame(ZFIN_human$V2, ZFIN_human$V4))
# Add column IDs
colnames(ZFIN_human) <- c("zf_gene","human_gene")

5 References

  • Mary Piper, Meeta Mistry, Jihe Liu, William Gammerdinger, & Radhika Khetani. (2022, January 6). hbctraining/scRNA-seq_online: scRNA-seq Lessons from HCBC (first release). Zenodo. https://doi.org/10.5281/zenodo.5826256.
  • Peter W Harrison, M Ridwan Amode, Olanrewaju Austine-Orimoloye, Andrey G Azov, Matthieu Barba, If Barnes, Arne Becker, Ruth Bennett, Andrew Berry, Jyothish Bhai, Simarpreet Kaur Bhurji, Sanjay Boddu, Paulo R Branco Lins, Lucy Brooks, Shashank Budhanuru Ramaraju, Lahcen I Campbell, Manuel Carbajo Martinez, Mehrnaz Charkhchi, Kapeel Chougule, Alexander Cockburn, Claire Davidson, Nishadi H De Silva, Kamalkumar Dodiya, Sarah Donaldson, Bilal El Houdaigui, Tamara El Naboulsi, Reham Fatima, Carlos Garcia Giron, Thiago Genez, Dionysios Grigoriadis, Gurpreet S Ghattaoraya, Jose Gonzalez Martinez, Tatiana A Gurbich, Matthew Hardy, Zoe Hollis, Thibaut Hourlier, Toby Hunt, Mike Kay, Vinay Kaykala, Tuan Le, Diana Lemos, Disha Lodha, Diego Marques-Coelho, Gareth Maslen, Gabriela Alejandra Merino, Louisse Paola Mirabueno, Aleena Mushtaq, Syed Nakib Hossain, Denye N Ogeh, Manoj Pandian Sakthivel, Anne Parker, Malcolm Perry, Ivana Piližota, Daniel Poppleton, Irina Prosovetskaia, Shriya Raj, José G Pérez-Silva, Ahamed Imran Abdul Salam, Shradha Saraf, Nuno Saraiva-Agostinho, Dan Sheppard, Swati Sinha, Botond Sipos, Vasily Sitnik, William Stark, Emily Steed, Marie-Marthe Suner, Likhitha Surapaneni, Kyösti Sutinen, Francesca Floriana Tricomi, David Urbina-Gómez, Andres Veidenberg, Thomas A Walsh, Doreen Ware, Elizabeth Wass, Natalie L Willhoft, Jamie Allen, Jorge Alvarez-Jarreta, Marc Chakiachvili, Bethany Flint, Stefano Giorgetti, Leanne Haggerty, Garth R Ilsley, Jon Keatley, Jane E Loveland, Benjamin Moore, Jonathan M Mudge, Guy Naamati, John Tate, Stephen J Trevanion, Andrea Winterbottom, Adam Frankish, Sarah E Hunt, Fiona Cunningham, Sarah Dyer, Robert D Finn, Fergal J Martin, and Andrew D Yates. (Ensembl 2024). Nucleic Acids Res. 2024, 52(D1):D891–D899. 10.1093/nar/gkad1049
  • Bradford, Y.M., Van Slyke, C.E., Ruzicka, L., Singer, A., Eagle, A., Fashena, D., Howe, D.G., Frazer, K., Martin, R., Paddock, H., Pich, C., Ramachandran, S., Westerfield, M. (2022) Zebrafish Information Network, the knowledgebase for Danio rerio research. Genetics. 220(4). 10.1093/genetics/iyac016

6 Session Info

sessionInfo()
R version 4.4.2 (2024-10-31)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_4.4.2    fastmap_1.2.0     cli_3.6.3         tools_4.4.2      
 [5] htmltools_0.5.8.1 yaml_2.3.10       rmarkdown_2.29    knitr_1.49       
 [9] jsonlite_1.8.9    xfun_0.49         digest_0.6.37     rlang_1.1.4      
[13] evaluate_1.0.1