Akey checksum calculator5/21/2023 ![]() ![]() ![]() We describe a solution for the missing metadata problem, whereby we embed a checksum of the RNA reference sequences in the output files during the expression quantification step. Often it is difficult for this critical metadata to be found for public datasets, and manually curating this information subjects the process to human error. In order that research findings can be computationally reproducible, it is critical that gene expression datasets are linked to the correct gene annotation, including the source of the annotation, the release number, and the location of the genes in a particular genome assembly. ![]() Gene expression quantification from RNA sequencing is a common component of many research publications. The computational paradigm of automatically adding annotation metadata based on reference sequence checksums can greatly facilitate genomic workflows, by helping to reduce overhead during bioinformatic analyses, preventing costly bioinformatic mistakes, and promoting computational reproducibility. The correct reference transcriptome is identified via a hashed checksum stored in the quantification output, and key transcript databases are downloaded and cached locally. We provide a solution in the form of an R/Bioconductor package tximeta that performs numerous annotation and metadata gathering tasks automatically on behalf of users during the import of transcript quantification files. It also makes it more difficult to locate the transcriptomic features, such as transcripts or genes, in their proper genomic context, which is necessary for overlapping expression data with other datasets. When files are shared publicly or among collaborators with incorrect or missing annotation metadata, it becomes difficult or impossible to reproduce bioinformatic analyses from raw data. Correct annotation metadata is critical for reproducible and accurate RNA-seq analysis. ![]()
0 Comments
Leave a Reply. |