ResearchHub | Open Science Community

Text Classification Algorithms: A Survey

Kamran Kowsari et al.Apr 23, 2019

In recent years, there has been an exponential growth in the number of complex documentsand texts that require a deeper understanding of machine learning methods to be able to accuratelyclassify texts in many applications. Many machine learning approaches have achieved surpassingresults in natural language processing. The success of these learning algorithms relies on their capacityto understand complex models and non-linear relationships within data. However, finding suitablestructures, architectures, and techniques for text classification is a challenge for researchers. In thispaper, a brief overview of text classification algorithms is discussed. This overview covers differenttext feature extractions, dimensionality reduction methods, existing algorithms and techniques, andevaluations methods. Finally, the limitations of each technique and their application in real-worldproblems are discussed.

Artificial Intelligence

Information Systems

0

Paper

Artificial Intelligence

1,222

0

Save

0

The Declining Risk of Post-Transfusion Hepatitis C Virus Infection

James Donahue et al.Aug 6, 1992

The most common serious complication of blood transfusion is post-transfusion hepatitis from the hepatitis C virus (HCV). Blood banks now screen blood donors for surrogate markers of non-A, non-B hepatitis and antibodies to HCV, but the current risk of post-transfusion hepatitis C is unknown.

Epidemiology

Immunology

0

Paper

Save

Urban Freeway Traffic Flow Prediction: Application of Seasonal Autoregressive Integrated Moving Average and Exponential Smoothing Models

Billy Williams et al.Jan 1, 1998

The application of seasonal time series models to the single-interval traffic flow forecasting problem for urban freeways is addressed. Seasonal time series approaches have not been used in previous forecasting research. However, time series of traffic flow data are characterized by definite periodic cycles. Seasonal autoregressive integrated moving average (ARIMA) and Winters exponential smoothing models were developed and tested on data sets belonging to two sites: Telegraph Road and the Woodrow Wilson Bridge on the inner and outer loops of the Capital Beltway in northern Virginia. Data were 15-min flow rates and were the same as used in prior forecasting research by B. Smith. Direct comparisons with the Smith report findings were made and it was found that ARIMA (2, 0, 1)(0, 1, 1) 96 and ARIMA (1, 0, 1)(0, 1, 1) 96 were the best-fit models for the Telegraph Road and Wilson Bridge sites, respectively. Best-fit Winters exponential smoothing models were also developed for each site. The single-step forecasting results indicate that seasonal ARIMA models outperform the nearest-neighbor, neural network, and historical average models as reported by Smith.

Control And Systems Engineering

Building And Construction

0

Paper

Control And Systems Engineering

450

0

Save

0

Cross-cultural psychology: Research and applications

Donald BrownJan 1, 1996

Applied Psychology

Social Psychology

0

Paper

Save

Joint representation learning for retrieval and annotation of genomic interval sets

Erfaneh Gharavi et al.Aug 22, 2023

Motivation As available genomic interval data increases in scale, we require fast systems to search it. A common approach is simple string matching to compare a search term to metadata, but this is limited by incomplete or inaccurate annotations. An alternative is to compare data directly through genomic region overlap analysis, but these approaches lead to challenges like sparsity, high dimensionality, and computational expense. We require novel methods to quickly and flexibly query large, messy genomic interval databases. Results Here, we develop a genomic interval search system using representation learning. We train numerical embeddings for a collection of region sets simultaneously with their metadata labels, capturing similarity between region sets and their metadata in a low-dimensional space. Using these learned co-embeddings, we develop a system that solves three related information retrieval tasks using embedding distance computations: retrieving region sets related to a user query string; suggesting new labels for database region sets; and retrieving database region sets similar to a query region set. We evaluate these use cases and show that jointly learned representations of region sets and metadata are a promising approach for fast, flexible, and accurate genomic region information retrieval.

Artificial Intelligence

Law

14

Paper

Artificial Intelligence

3

0

Save

0

Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings

Nathan LeRoy et al.Jul 2, 2024

Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

2

0

Save

0

Embeddings of genomic region sets capture rich biological associations in lower dimensions

Erfaneh Gharavi et al.May 9, 2021

Motivation Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. Results We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. Availability https://github.com/databio/regionset-embedding

Genetics

Artificial Intelligence

0

Paper

Save

Racial and Ethnic and Rural Variations in the Use of Hybrid Prenatal Care in the US

Peiyin Hung et al.Dec 6, 2024

Importance Understanding whether there are racial and ethnic and residential disparities in prenatal telehealth uptake is necessary for ensuring equitable access and guiding implementation of future hybrid (ie, both telehealth and in-person) prenatal care. Objective To assess temporal changes in individuals using hybrid prenatal care before and during the COVID-19 public health emergency (PHE) by race and ethnicity and residence location in the US. Design, Setting, and Participants This retrospective cohort study analyzed electronic health record data of prenatal care visits from the National COVID Cohort Collaborative Data Enclave, comprising data from 75 health systems and freestanding institutes in all 50 US states. Data were analyzed on 349 682 nationwide pregnancies among 349 524 people who gave birth from June 1, 2018, through May 31, 2022. Multivariable generalized estimating equations were used to examine variations in receiving hybrid vs only in-person prenatal care. Data phenotyping and analysis occurred from June 13, 2023, to September 27, 2024. Exposures Prenatal period overlap (never, partially, or fully overlapping) with the COVID-19 PHE, maternal race and ethnicity, and urban or rural residence. Main Outcomes and Measures Hybrid vs in-person–only prenatal care. Results Of 349 682 pregnancies (mean [SD] age, 29.4 [5.9] years), 59 837 (17.1%) were in Hispanic or Latino individuals, 14 803 (4.2%) in non-Hispanic Asian individuals, 65 571 (18.8%) in non-Hispanic Black individuals, 162 677 (46.5%) in non-Hispanic White individuals, and 46 794 (13.4%) in non-Hispanic individuals from other racial and ethnic groups. A total of 31 011 participants (8.9%) resided in rural communities. Hybrid prenatal care increased from nearly none before March 2020 to a peak of 8.1% telehealth visits in November 2020, decreasing slightly to 6.2% by March 2022. Among the fully overlapping group, urban residents had nearly 2-fold odds of hybrid prenatal care compared with rural people (adjusted odds ratio [AOR], 1.98; 95% CI, 1.84-2.12). Hispanic or Latino people (AOR, 1.48; 95% CI, 1.41-1.56), non-Hispanic Asian people (AOR, 1.47; 95% CI, 1.35-1.59), and non-Hispanic Black people (AOR, 1.18; 95% CI, 1.12-1.24) were more likely to receive hybrid prenatal care than non-Hispanic White people. Conclusions and Relevance In this cohort study, hybrid prenatal care increased substantially during the COVID-19 PHE, but pregnant people living in rural areas had lower levels of hybrid care than urban people, and individuals who belonged to racial and ethnic minority groups were more likely to have hybrid care than White individuals. These findings suggest that strategies that improve equitable access to telehealth for people who live in rural areas and people in some minority racial and ethnic groups may be useful.

Anthropology

Internal Medicine

0

Paper

Save

Diffusion and Multi-Domain Adaptation Methods for Eosinophil Segmentation

Kevin Lin et al.Mar 12, 2024

Eosinophilic Esophagitis (EoE) represents a challenging condition for medical providers today. The cause is currently unknown, the impact on a patient's daily life is significant, and it is increasing in prevalence. Traditional approaches for medical image diagnosis such as standard deep learning algorithms are limited by the relatively small amount of data and difficulty in generalization. As a response, two methods have arisen that seem to perform well: Diffusion and Multi-Domain methods with current research efforts favoring diffusion methods. For the EoE dataset, we discovered that a Multi-Domain Adversarial Network outperformed a Diffusion based method with a FID of 42.56 compared to 50.65. Future work with diffusion methods should include a comparison with Multi-Domain adaptation methods to ensure that the best performance is achieved.

Artificial Intelligence

Immunology

0

Paper

Artificial Intelligence

Immunology

0

Save

0

Automatic Report Generation for Histopathology Images Using Pre-Trained Vision Transformers and BERT

Saurav Sengupta et al.May 27, 2024

Artificial Intelligence

Molecular Biology

0

Paper

Artificial Intelligence

Molecular Biology

0

Save