An open-source multi-semantic annotation dataset and automated recognition tool for viral carcinogenesis factors
- Honglian Huang,
- Danqi Huang,
- Ziyi Wei,
- Yanling Qi,
- M. James C. Crabbe,
- Xiaoyan Zhang
- Tongji University,
- Shanghai University of Traditional Chinese Medicine,
- University of Shanghai for Science and Technology,
Open access
Sustainable Development Goals
- SDG 3 Good Health and Well
Abstract
In-depth investigations into the characteristics of high-risk oncogenic viruses are critical for the early prevention and control of related cancers and the development of effective vaccines. The mechanism of viral carcinogenesis involves numerous risk factors such as viral genomic variations, lifestyle, and environmental influences. Based on literature data on eight oncogenic viruses, we have created a large-scale, semantically rich corpus of viral carcinogenic factors, including 551715 abstracts and 5821308 entities, using natural language processing technology combined with expert knowledge. We also developed a semantic filter to improve entity recognition performance. Moreover, transcriptomic data related to oncogenic viruses were collected. We performed gene differential expression analysis, feature gene identification, and immune microenvironment analysis. A visual knowledge platform, an open-source dataset, and a tool for automatically identifying internal and external semantic factors related to viral carcinogenesis are available at http://www.biomedinfo.cn:8281/. This study provides new insights into the key factors involved in the viral carcinogenesis process and helps researchers and clinicians quickly obtain clues for further experimental research and clinical validation.
Publication Information
Output type
Original language
EnglishArticle number
baaf038Journal (Volume, Issue Number)
Database : the journal of biological databases and curation (Volume 2025)Publication milestones
- Accepted/In press - 14/05/2025
- Published - 24/09/2025
Publication status
ISSN
1758-0463External Publication IDs
- Scopus: 105016908763
- PubMed: 40996719
