Skip to search boxSkip to navigationSkip to main content

An open-source multi-semantic annotation dataset and automated recognition tool for viral carcinogenesis factors

  • Honglian Huang
    ,
  • Danqi Huang
    ,
  • Ziyi Wei
    ,
  • Yanling Qi
    ,
  • M. James C. Crabbe
    ,
  • Xiaoyan Zhang
  • Tongji University
    ,
  • Shanghai University of Traditional Chinese Medicine
    ,
  • University of Shanghai for Science and Technology
    ,
Research Output: Contribution to journal Article Peer-review

Open access

Sustainable Development Goals

  • SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well

Abstract

In-depth investigations into the characteristics of high-risk oncogenic viruses are critical for the early prevention and control of related cancers and the development of effective vaccines. The mechanism of viral carcinogenesis involves numerous risk factors such as viral genomic variations, lifestyle, and environmental influences. Based on literature data on eight oncogenic viruses, we have created a large-scale, semantically rich corpus of viral carcinogenic factors, including 551715 abstracts and 5821308 entities, using natural language processing technology combined with expert knowledge. We also developed a semantic filter to improve entity recognition performance. Moreover, transcriptomic data related to oncogenic viruses were collected. We performed gene differential expression analysis, feature gene identification, and immune microenvironment analysis. A visual knowledge platform, an open-source dataset, and a tool for automatically identifying internal and external semantic factors related to viral carcinogenesis are available at http://www.biomedinfo.cn:8281/. This study provides new insights into the key factors involved in the viral carcinogenesis process and helps researchers and clinicians quickly obtain clues for further experimental research and clinical validation.

Publication Information

Output type

Research Output: Contribution to journal Article Peer-review

Original language

English

Article number

baaf038

Journal (Volume, Issue Number)

Database : the journal of biological databases and curation (Volume 2025)

Publication milestones

  • Accepted/In press - 14/05/2025
  • Published - 24/09/2025

Publication status

Published - 24/09/2025

ISSN

1758-0463

External Publication IDs

  • Scopus: 105016908763
  • PubMed: 40996719