A new version of ResearchHub is available.Try it now
Post
Document
Flag content
47

ARTEMIS - Automated Review and Trustworthy Evaluation for Manuscripts in Science

Published
Feb 27, 2025
Peer Review
Save
Document
Flag content
47
Save
Document
Flag content
10,339 RSC
raised of
893,442 RSC
$0.00
goal
one minute to go
Author Profile Avatar
Author Profile Avatar
Author Profile Avatar
10Supporters

Authors and Affiliations

Jeffrey Koury, ResearchHub Foundation

Dominikus Brian, ResearchHub Foundation

Abstract

The peer review process, essential for maintaining scientific rigor, faces significant challenges including reviewer fatigue, bias, and insufficient rigor. The rapid increase in scientific publications has placed an unsustainable burden on a limited pool of reviewers, resulting in the inability of reviewers to detect falsified data, screen errors with high fidelity, and even establish consensus between reviewers [1]. ResearchHub Foundation aims to address these issues by 1) incentivizing reviewers and 2) integrating Agentic AI systems powered by recent breakthroughs in large language and multi-modal foundational models. Our unique approach with validated expert human reviews, and trustworthy automated tools to complement human reviewers, will enhance the quality and credibility of peer reviews to benefit both authors and the larger scientific community.

We have observed from our own experiments (200+ AI peer review samples) that AI reviewers have capabilities that excel in domains such as detecting technical errors and statistical flaws, while human reviewers provide greater insight into the broader relevance and context of research. We hypothesize that incorporating human feedback into AI models of peer review can improve their performance beyond existing state-of-the-art AI systems without human feedback. The project has two aims:

  1. Establish a Reliable Protocol to Assess Reviewer Performance: Using a standardized rubric, human editors with specific sets of expertise will evaluate AI and human reviews across methods, results, discussion, rigor, novelty, and error detection. AI reviewers are expected to outperform in technical rigor, while humans excel in contextual analysis.
  2. Develop a Versatile Agentic Solutions for Scientific Auditing : By leveraging upon the growing number of toolkits for deploying AI Agents driven by LLM, Large multi-modal models (LMMs), long-term memories, and orchestration infrastructure, we ought to develop specialized agentic solutions honed for various aspects of scientific document auditing.

This initiative leverages ResearchHub’s extensive reviewer database, standardized prompts, and expertise-weighted evaluations to enhance the peer review process. By addressing systemic issues and integrating AI and human expertise, this project seeks to improve the accuracy, efficiency, and scalability of scientific peer review, supporting the advancement of open and validated science.

Introduction and Significance

Scientific Auditing - Peer review is a cornerstone of scientific integrity, serving as the primary mechanism for validating research and granting manuscripts the status of “verified” work. However, the process is increasingly strained due to systemic challenges. Peer reviewers face difficulties in thoroughly assessing statistical analyses, verifying references, and evaluating claims against the broader scientific literature. Despite the critical nature of their role, reviewers are not compensated for their efforts, which often results in diminished rigor, personal biases, superficial feedback, and gaps in understanding key aspects of the studies they assess. These challenges are further exacerbated by the rapid growth in scientific publications, which has significantly increased the workload on an already limited pool of reviewers. The number of scientific publications has increased an estimated 47% from 2016 to 2022 [2]. This combination of rising demand and insufficient support has created a critical bottleneck in the peer review process, highlighting the urgent need for innovative solutions to enhance the accuracy, efficiency, and sustainability of scientific auditing. With the rise of techniques resulting in large terabyte sized datasets (i.e. single-cell RNAseq, BARseq) the complexity for detecting errors and the time demand to do so for human reviewers is amplified.  Additionally, this unfortunate combination has resulted in excessive burden on a small number of reviewers resulting in reviewer fatigue. The use of ResearchCoin incentives on ResearchHub has provided the impetus to overcome this apprehension of reviewers to peer review, and provides a new found incentive and motivation. Nonetheless, even with appropriate incentives, human reviewers continue to have blindspots during the review process. 

Agentic AI solution - With the rapid mass-adoption of LLMs and AI Agents, the ability to holistically assess a manuscript has become feasible at rapid speeds and low costs, providing immensely valuable insights into the paper that often reside within human reviewer’s blind spots. These tools can fill the gaps of human reviewers limitations and provide a more expedited turnaround time on feedback and requested edits, ultimately adding a more holistic set of feedback and context to the academic manuscript. Foundational models (o1, Claude, Gemini, LLaMA, DeepSeek, etc.) have made strides and provide a basis for this approach. With smaller amounts of data, such as peer review data on ResearchHub, it is already possible to substantially leverage these foundation models with minimal additional training data to construct a much more accurate peer review system greater than the sum of its parts. AI reviewers are not a panacea, and it’s becoming clear that human reviewers are still needed, but will have AI reviewers as a powerful supplement [3]. 

Hypothesis and Aims

Hypothesis: Published and preprinted papers have errors or issues within the manuscript that AI reviewers can catch that are often overlooked by human reviewers, and integrating human reviewer/editor reinforcement learning can enhance capabilities of the AI review system when domain specific context is essential.  

Aim 1: Assess the strengths and weaknesses of human reviewers and AI reviewers using a standardized rubric. Human editors will evaluate the outputs of both human reviewers and AI reviewers for methods, results, discussion, rigor, novelty and error detection rate. Anticipated result: Human peer reviewers will be superior in interpreting the implications of the methods and results that are NOT explicitly mentioned in the manuscript, while AI reviewers will be more thorough in assessing statistics, detecting errors etc based on information available explicitly in the manuscript. 

Aim 2: Develop versatile systems of agentic solutions driven by LLM, LMMs, and AI Agents, empowered by domain specific knowledge, in combination with human evaluation and feedback in producing thorough evaluation of scientific documents. Anticipated result: After tuning and including post-training of foundational models with the human evaluation and feedback, the resulting agentic solutions will perform better than the direct use of benchmark foundational models alone when used to a particular scientific auditing task (manuscript review, editor’s feedback, journal decision, etc.)

Methods and Design

Currently, the use of already existing foundation models equipped with simple prompt engineering has provided a viable starting point for pinpointing errors and giving additional context in scientific manuscripts. We’ve developed a minimum viable system in the form of a web app shown in Fig. 1 below. We also made several observations throughout our pilot experiment in which, holistically, using the same optimized prompt, Gemini 2.0 has been the most robust foundation model in producing consistent results human editors and readers generally consider more useful. This is in part due to the 1M+ context window for token processing, roughly 5-7 times greater than the typical 128k and 280k context windows found in other foundation models. 

Figure 1. Web-application developed by ResearchHub Foundation using AI agents (AI Editor and AI Peer Reviewers). 

Figure 2. Proposed ARTEMIS system architecture and workflow design.

In Fig. 2 , we present the proposed ARTEMIS system architecture and workflow design, here we follow on with describing the key actions happening within each step of the process: 

  1. Scientific Document Assigned for Decision-Making.
  2. Human users who make decisions based on scientific documents submit audit requests to gain further insights regarding the document and the scientific works it represents.
  3. Document Pre-Processing for Metadata Extraction.
  4. Depending on the metadata content, a specific full extraction protocol is performed to obtain the maximum amount of information from the document.
  5. Full extraction results are sent to the AI Quality Control agent to ensure they meet the required standards and can be used further.
  6. (6b, 7b, 8b.) Extracted document content that fails to satisfy the minimum requirements is sent to humans for manual decision-making, annotation of reasons for exclusion, and archiving within the Data Lakehouse. 
    (6a & 6a’.) Extracted documents that pass the QC requirements, either automatically or manually after human consideration, are passed on to the AI Editor.
  7. (7a.) The AI Editor generates or orchestrates the assignment of the document to the appropriate AI reviewer(s).
  8. (8a.) AI Peer Reviewer(s) generate their assessment of the document and send it to the AI QC agent to ensure correct structure and compliance with the required review standards.
  9. All reviews that pass the AI QC check are curated into a raw report ready for human evaluation.
  10. The raw report is prepared (and reformatted if necessary) to be presented to humans for Human Evaluation.
  11. Human evaluation results are submitted to the AI QC for further checks, including verifying whether the contribution is authentically human and backed by proper login credentials.
  12. Human feedback from verified editors or validators is requested to justify the quality of the given human evaluation results.
  13. All data collected thus far is curated into a coherent dataset corresponding to the scientific document and the auditing task at hand.
  14. The dataset is organized into its respective zones and/or warehouses within the Data Lakehouse.
  15. Actionable insights are produced based on the knowledge available within the Data Lakehouse to answer the questions posed in the original audit request. This process can also be further facilitated by interactive Data Librarian/Analyst Agents that help users explore or generate additional insights beyond the original request.

Aim 1 - Strengths & Weaknesses of Human vs AI Reviewers

The first stage of the proposal is focused on producing the “Human Wisdom” components described across steps 10 to 13 of the ARTEMIS architecture and work flow (Fig. 2). In the process of producing our human evaluation components of our desired dataset, we will prompt the AI Reviewers and Human Reviewers to perform assessment based on two requirements:

  1. Classic ResearchHub Foundation prompt:
    1. ResearchHub Foundation bounty prompt (italicized):
      1. Your review must be submitted during the 14-day submission window from the day this bounty was initiated.
      2. Mention your credentials and include your areas of relevant expertise and limitations in assessing this preprint.
      3. Include the version of the preprint you are reviewing (e.g. 1st, 2nd, etc.). Make sure you review the most updated version available.
      4. Use the rating system in the "Peer Reviews" tab for all 5 criteria (overall assessment, introduction, methods, results, discussion) but the content within each is flexible (in-line comments can be used instead of a block of text in each section).
      5. For each figure, including supplementary material, you must provide a detailed assessment (pros or cons).
      6. Plagiarism will not be tolerated. You must comprehensively disclose any use of artificial intelligence (AI) platforms used in the review process. Please refer to our AI Policy for additional details.
    2. Currently, ResearchHub processes approximately 170 peer reviews a week (Fig 3a), with 68.7% of all preprints who receive a peer review incentive receiving at least 1 peer review that meets the bar of quality (Fig 3b). Only reviews meeting this minimum bar, as deemed by ResearchHub’s editorial team, will be leveraged for this study.
    3. The fields of expertise that ResearchHub peer reviewers specialize in are namely in the biomedical and hard sciences including molecular biology, immunology, oncology, neuroscience, etc (Fig 3c, 3d).
  2. Error Detection prompt:
Please carefully review the following academic paper and identify any errors, issues, inconsistencies, or 
questionable reasoning, with a special focus on the mathematical content. In particular, examine the 
correctness and clarity of all formulas, derivations, and calculations. Check for doubtful hypotheses, shaky
assumptions, incorrect mathematical statements, undefined terms, unjustified steps, missing references for 
key theorems, or discrepancies between the described methods and the results presented. If any part of the 
argument seems incomplete, unclear, or unconvincing, highlight these sections and explain why they might be 
problematic. Additionally, look for subtle errors such as misapplied formulas, overlooked boundary 
conditions, or unsupported assumptions in modeling steps. Make sure to check the figures, captions, and meta
data of the paper whenever relevant. Finally, suggest potential ways to strengthen or clarify the 
mathematical aspects of the paper.

{
  "overall_assessment": "<Overall Assessment and Key Takeaways>",
  "mathematical_rigor_and_correctness": {
    "equation_summary": {
      "inline_equations": <number_of_inline_equations>,
      "display_equations": <number_of_display_equations>
    },
   "equation_details" :{
      "inline_equations": <list of identified inline_equations>,
      "display_equations": <list of identified display_equations>,
    },
    "abbreviation_and_symbol_definitions": "<Comment on whether all abbreviations and mathematical symbols are well-defined. 
    Explicitly explain ill defined items.>",
    "sign_and_operation_checks": "<Comment on the correctness of signs and mathematical operations. 
    Explicitly explain what is the issue and where are the errors.>"
  },
  "issues_list": [
    {
      "issue": "<Description of Issue 1>",
      "severity_rating": <numerical_rating_from_1.0_to_10.0>,
      "severity_explanation": "<Explanation of the rating with rationale>",
      "suggestion": "<Suggestion with concrete actionable steps to address this issue>"
    },
    {
      "issue": "<Description of Issue 2>",
      "severity_rating": <numerical_rating_from_1.0_to_10.0>,
      "severity_explanation": "<Explanation of the rating with rationale>",
      "suggestion": "<Suggestion with concrete actionable steps to address this issue>"
    }
    // Repeat for issues 3 to 10
  ]
}

Standardized Scoring Approach for Severity Rating:

1.0 - 2.9: Minor typographical errors or negligible issues that do not affect understanding.
3.0 - 5.9: Moderate issues that may cause some confusion or misunderstandings but are correctable.
6.0 - 8.9: Significant errors that impact the validity of results or the clarity of the paper.
9.0 - 10.0: Critical flaws that fundamentally undermine the paper's conclusions or correctness.
In the issues_list, provide a list of the 10 most important or most glaring mistakes or issues, following the standardized 
scoring approach to maintain objectivity across papers. Be specific to the current paper.

**Ensure that your response is formatted as valid JSON only, without any additional text or commentary.
At ALL COST, NEVER INCLUDES ```json, in your valid JSON answer. a valid json should start with { and ends with }

Figure 3a. Total Number of Completed Peer Reviews per Week on ResearchHub from May 2022 to present. ResearchHub has housed almost 4000 peer reviews at a rate of roughly 170 peer reviews a week. Blue indicates a concerted effort to promote the peer review process on ResearchHub. 
 

Figure 3b. Total number of manuscripts separated by if ResearchHub Foundation received peer reviews where we distributed the bounty (green, 68.7%) or didn’t distribute the bounty (red, 31.3%). Green signals a manuscript which received at least 1 peer review that met a minimum threshold of quality to receive the bounty. Red signals a manuscript which either didn’t receive any peer review or received peer review but didn’t meet a minimum threshold of quality or abiding by the stipulations set by ResearchHub Foundation.

Figure 3c. Number of Peer Reviews previously completed on ResearchHub stratified by OpenAlex Subfields


 

Figure 3d. Number of Peer Reviews previously completed on ResearchHub stratified by OpenAlex Topics
 

The Human Feedback component will be primarily driven by human editors/validators who will assess the precision and accuracy of the human and AI reviewers based on the claims made. The collection of human feedback will then be used to benchmark and improve the AI-generated reviews. An example of how this benchmarking can be performed is shown in the recent work ReviewEval by Kumar et al.[4] and the references therein. Part of which is already available within the ARTEMIS system under development.  With the growth in community interest for open peer reviews, of which ResearchHub has adopted, human validators can quickly view and provide input to the reviews being produced in an open transparent marketplace.[5] The editors/validators will use a flash cards style approach for simple validation. The flash cards approach will appear in several forms, depending on the relevant context. Typical cases would include: Fact-Check, Logic-Check, Math-Check, or Expert Confidence. This alternative approach to consolidating expert feedback is designed as a "bite-sized" task that an expert in the field can answer or reason through without needing to read the entire paper. All context necessary for understanding or providing meaningful feedback on the specific excerpt will be provided on the flashcard. Peer reviewers are assumed to be knowledgeable about the particular topic and their claims will be subject to various weights depending on their background/reputation.

  1. ResearchHub’s verification and reputation system: Using OpenAlex, an open repository for scientific data, ResearchHub has ascribed a reputation score for verified users within their various fields of expertise (Fig. 4). This will be used to value or weigh the statements made by various peer reviewers and validators.
  2. Reviewer’s track record will be checked based on their past performances and previous editorial feedback, if any available. Additionally, each peer reviewer may need to complete a mini-quiz before receiving a tag for specific expertise, ensuring that they actually possess the required expertise. If the reputation system isn’t sufficient for properly confirming a user’s expertise, the mini-quiz will be employed and include single line domain specific questions, with each question restricted to a 5-10 second countdown to receive an answer.

Figure 4. Example of a verified user account on ResearchHub with a quantitative percentile ranking as a “reputation” based on publication history.
 

Alternative Approach: Assuming reviewers and/or editors will be difficult to source for the work, alternatively we can validate the AI reviews against other pre-existing datasets [6]. Additionally, in the context where the flash cards approach doesn’t prove to be a good source of validation,  we can have validators assess the reviewers based on an adapted version of the ResearchHub Foundation’s standardized unified rubric (not public, but unified recommendations to reviewers can be found here) to give weighted scores on methods, results, discussion, rigor, novelty and error detection rate. Editors evaluations will be weighted contingent on their reputation score in that field according to ResearchHub’s reputation system.

Aim 2 - Develop Versatile Agentic Solutions for Scientific Auditing

Functions described in steps 5 to 9 (Fig. 2) for developing AI agents for scientific auditing, we propose to develop 3 agentic roles and 5 auxiliary tooling functions. The 3 agentic roles will be composed of: Quality Control Agent, AI Reviewer Agent, AI Editor. Each of these roles may be deployed under different context and configuration to achieve the desired performance. For example, this may include an AI Editor in Neuroscience, AI Reviewer in Neuroscience, AI quality control in Neuroscience etc parallelized for a plethora of fields of expertise. 

The 5 Auxiliary toolings are : Long-Term Memory, Knowledge Base, Visual Understanding and Reasoning, Scientific References, Consistency of Structured Output. The design of these toolings will be made as modular as possible to allow replacement of the tech-stack when necessary to boost performance and efficiency.

Rigor

Human reviewers can openly review any preprints, allowing for expedited reviews. The presence of a standardized prompt for human and AI reviewers, and standardized rubrics for editors allows for apples to apples direct comparisons on specific components of the reviewers. The verification and reputation v2 system on ResearchHub will be employed to weight evaluations according to the expertise of the individual making the claims. 

Additionally, prior to this pre-registration, we have performed preliminary studies that attempt to understand the dynamics, economics, and quality of incentivized peer review process of which is being employed on ResearchHub [7]. All three aspects of open peer review submissions addressed above (dynamics, economics, and quality) are highly dependent with at least 9 other coupled factors beyond economic and non-economic incentives (Table 1). Therefore, any correlation found would be contextual and apply only to a certain combination of peer review configurations that involved at least the 9 factors. As it pertains to aim 1 of this pre-registration, we will ensure that the peer review context will be fixed ensuring maximal reproducibility.

Table 1. The Nine Identified factors influencing correlation between incentives on the quality of open peer reviews of scientific manuscripts

Factor Influencing Incentive-Quality Correlation

Variables 

Order of Incentivization

  • Award First (Commissioned Task)
  • Award Later (Rewarded)
  • Bidding Mechanism (Stake for Chance)

Reviewer Pool Size 

  • Small [<100]
  • Medium [100-1000]
  • Large [1000+]

Expertise Diversity

Ranging from 0 to 1 

0: All reviewers are from the exact same research subfield

1: For each subfield there are exactly equal number of reviewer

* Each reviewer can have more than 1 expertise, but maximum 3 notable expertise. Complete list of subfields, will be at most 3 x number of reviewers in the pool.

Reviewer Selection Criteria

  • Single Chief-Editor (or Campaign Host) make the final call
  • First Come First Served [Based on Objective Quality Scores or Time Order]
  • Democratic Decision [Community Votes or Expert Scoring]

Type of Incentive and Size/Intensity

  • Monetary
  • Reputation
  • Monetary + Reputation

Requested Review Structure

  • Simple (Yes/No answer and inline comments)
  • Moderate (Structured Comments on Different Aspects)
  • Rigor (Complete Assessment on All Sections, Figures, References, and more)

Difficulties of Review Task

  • Easy (General Knowledgeable Public)
  • Medium (Graduate Student Level on the Topic)
  • Hard (Subject Expert on the Topic)

Likelihood & Proportion of AI Usage 

  • Low (<10%)
  • Moderate (10-90%)
  • High (>90%)

Time Restriction

  • Intense (<3 Days)
  • Moderate (3-14 Days)
  • Leisurely ( >15 Days)




 












Budget

Table 2. Budget description and breakdown for an estimated 12 month timeline. 

Section

Need

Details

Cost

Development
  1. Compute
  2. API
  3. Outsourced developer work for data cleaning and UI/UX design
Digital Ocean, o1/claude/gemini APIs, GPUs, observability system, auxiliary tools for AI Agents, and vector storage, etc.$10,000
Human reviewer costs 

Incentive for Peer Reviewers for ResearchHub Foundation selected papers

  • Incentive for Peer Reviewers for YesNoError (YNE) selected papers - 50 papers
  • Incentive for Peer Reviewers for Gareth and Maria’s selected papers - 50 papers
     

Preprints in the fields of biology, chemistry, neuroscience, immunology, genetics, molecular biology, biochemistry, oncology, mathematics, physics, computer science, etc



 

$150,000 for 1000 human reviews (assuming $150 per peer review and 1 review per preprint)
Human editor costsIncentive for Validators (Editors or Authors) to assess AI and Human Reviewer OutputsValidators will be asked to provide either full or partial feedback on both the original preprint and/or the AI review submissions.

$50,000 for 1000 editors (assuming $50 to assess all peer reviews per manuscript)


 

Administrative CostMarketing

 
The cost associated with marketing efforts to ensure campaign visibility and sustain interest for contributions from the community.$5,000
Reserve FundFor unexpected needs and for insurance against volatility.  $15,000, If unused, will be put toward getting a larger dataset.

 

Disclosure and Licensing Statement

This document outlines the terms under which the content, outputs, and tools resulting from our study on AI reviewers for scientific manuscripts are shared and utilized. Transparency and collaboration are core to our research philosophy, and we aim to strike a balance between open science principles and the potential for practical, real-world applications.

Open Source Commitment

  1. Content and Outputs:

    • All data, algorithms, models, and other outputs derived from this study will be made available under an open-source license, allowing unrestricted access for academic, research, and non-commercial purposes.
    • The specific license will be the MIT License to ensure clarity regarding permissible uses and attribution requirements.
  2. Repository Access:

    • Public repositories (e.g., on GitHub, Zenodo, HuggingFace or similar platforms) will host the content and outputs, accompanied by detailed documentation to facilitate their reuse and extension by the broader research community.

Commercialization Clause

While the content and outputs of this study will be open-sourced, we reserve the right to commercialize tools or technologies derived from these resources. 

By participating in this study or utilizing its outputs, users acknowledge and agree to the terms outlined in this disclosure and licensing statement.

Declaration of Generative AI in Scientific Writing

During the preparation of this work, ChatGPT was used to enhance the readability and language of some parts of the manuscript. Additionally, it was used to consolidate the original content written in the introduction, methods, and preliminary data to construct an Abstract. Following any ChatGPT usage, content was reviewed and edited as needed.

Acknowledgements

We would like to acknowledge Matt Schlicht, founder of YesNoError for being a Strategic Data Partner and Gareth Dyke & Maria Machado as Strategic Expert Community Partners.  

References

[1] Howard Bauchner, Frederick P Rivara, Use of artificial intelligence and the future of peer review, Health Affairs Scholar, Volume 2, Issue 5, May 2024, qxae058, https://doi.org/10.1093/haschl/qxae058

[2] Hanson, M. A., Gómez Barreiro, P., Crosetto, P., & Brockington, D. (2024). The strain on scientific publishing. arXiv. https://arxiv.org/abs/2309.15884v2

[3] Seghier, M.L. (2025), "AI-powered peer review needs human supervision", Journal of Information, Communication and Ethics in Society, Vol. 23 No. 1, pp. 104-116. https://doi.org/10.1108/JICES-09-2024-0132

[4] Kirtani, C., Garg, M. K., Prasad, T., Singhal, T., Mandal, M., & Kumar, D. (2025). ReviewEval: An evaluation framework for AI-generated reviews. arXiv. https://arxiv.org/abs/2502.11736v2

[5] Yang, J. (2025). Paper Copilot: The artificial intelligence and machine learning community should adopt a more transparent and regulated peer review process. arXiv. https://arxiv.org/abs/2502.00874

[6] Kuznetsov, I., Gurevych, I. et. a. (2024). What can natural language processing do for peer review? arXiv. https://arxiv.org/abs/2405.06563 GitHub Repository: Afzal, O. M. (n.d.). nlp-for-peer-review. GitHub. https://github.com/OAfzal/nlp-for-peer-review

[7] Brian, D. (2024). Incentivized vs non-incentivized open peer reviews: Dynamics, economics, and quality. ResearchHub. https://doi.org/10.55277/ResearchHub.taescjxh









 

100%
Discussion


Start the discussion.
This post has not yet been discussed.