wcGeneSummary: Text Mining And Annotating Gene Cluster

wcGeneSummary: Text Mining And Annotating Gene Cluster


Author(s): Noriaki Sato

Affiliation(s): Kyoto University



The functional annotation of the gene lists identified by gene clustering or differential expression analysis is one of the central focuses of bioinformatics analysis. The enrichment analysis can address the problem using the curated biological pathway databases, however, using text mining approaches have a potential to annotate the gene list in better resolution or reveal the previously unknown mechanism not listed in the databases. Thus, we developed a package wcGeneSummary for text mining the RefSeq summary data of the gene list and generating the word cloud and co-occurrence networks of words using the R packages including GeneSummary, ggraph, wordcloud, and tm. We illustrate examples including annotating gene clusters identified by weighted gene coexpression network analysis (WGCNA) which do not have statistically significant pathways from the enrichment analysis, annotating a dendrogram of module eigengenes from WGCNA, and constructing a co-occurrence network of words in biological pathways in the databases. The results suggested that the visualization based on words and visual inspection of a dendrogram with text information is useful for interpreting gene clusters in the analysis like WGCNA. The package is available at: https:/github.com/noriakis/wcGeneSummary.