Read and validate a table with genes (that should be tested in overrepresentation-analysis) for compatibility with this R package#'

if 'pvalue' is not in the genelist columns, it is set and defaulted to 1 for visualization purposes if 'effectsize' is not in the genelist columns, it is set and defaulted to 0 for visualization purposes

read_validate_genelist(
  file,
  remove_non_numerical_ids = TRUE,
  remove_duplicated = TRUE,
  remove_Rik_genes = TRUE,
  remove_Gm_genes = TRUE,
  map_organism = NULL
)

Arguments

file

full filepath to gene tibble in .csvs/.xlsx/.tsv

remove_non_numerical_ids

boolean, default TRUE, if non-numerical in gene column, remove

remove_duplicated

boolean, default TRUE, removes duplicated gene symbols/ids

remove_Rik_genes

boolean, default TRUE, grepl("Rik$") search and remove Riken non-canonical mouse genes

remove_Gm_genes

boolean, default TRUE, grepl("^Gm") search and remove Gm non-canonical mouse genes

map_organism

default: NULL, if numeric taxid, used for selecting org.Xx.eg.db to map gene symbols to gene column via AnnotationDbi::mapIds(keytype = 'ALIAS') - if mapped to NA the genes are removed - need to download org.Xx.eg.db manually! Symbols are set toupper() to match formatting. Protein symbols could be used too.

9606 = Human (Homo sapiens) (org.Hs.eg.db)
9544 = Rhesus monkey (Macaca mulatta) (org.Mmu.eg.db)
10090 = Mouse (Mus musculus) (org.Mm.eg.db)
10116 = Rat (Rattus norvegicus) (org.Rn.eg.db)
7227 = Fruit fly (Drosophila melanogaster) (org.Dm.eg.db)
6239 = Worm (Caenorhabditis elegans) (org.Ce.eg.db)

Value

tibble dataframe with columns: symbol (string), gene (string as integer ID), pvalue (numeric), effestsize (numeric)

Examples

file_path <- system.file("extdata", "example_genelist.csv", package = "goatea")
read_validate_genelist(file = file_path)
#> Checking file format...
#> # A tibble: 100 × 5
#>    symbol   gene pvalue effectsize signif
#>    <chr>   <int>  <dbl>      <dbl> <lgl> 
#>  1 gene_29 14969 0.0783       -4.9 FALSE 
#>  2 gene_75 12070 0.845        -4.9 FALSE 
#>  3 gene_76 10534 0.0121        4.8 TRUE  
#>  4 gene_12 15042 0.282         4.8 FALSE 
#>  5 gene_68 14654 0.0203       -4.8 TRUE  
#>  6 gene_98 12937 0.791        -4.8 FALSE 
#>  7 gene_13 17523 0.201         4.7 FALSE 
#>  8 gene_35 15743 0.471         4.7 FALSE 
#>  9 gene_69 19501 0.0133       -4.7 TRUE  
#> 10 gene_65 19908 0.306        -4.6 FALSE 
#> # ℹ 90 more rows