ResearchHub | Open Science Community

XX

Xiangzhe Xu

Author with expertise in Characterization and Detection of Android Malware

Achievements

This user has not unlocked any achievements yet.

Key Stats

Upvotes received:

0

Publications:

5

(40% Open Access)

Cited by:

0

h-index:

5

/

i10-index:

2

Reputation

Biology

< 1%

Chemistry

< 1%

Economics

< 1%

Show more

How is this calculated?

Publications

Sanitizing Large Language Models in Bug Detection with Data-Flow

Chengpeng Wang et al.Jan 1, 2024

Artificial Intelligence

Signal Processing

0

Paper

Artificial Intelligence

Signal Processing

Save

CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking

Zian Su et al.Jul 12, 2024

Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to the right correlations/contexts without the help of symbols. We propose a new method to pre-train general code models when symbols are lacking. We observe that in such cases, programs degenerate to something written in a very primitive language. We hence propose to use program analysis to extract contexts a priori (instead of relying on symbols and masked language modeling as in vanilla models). We then leverage a novel attention masking method to only allow the model attending to these contexts, e.g., bi-directional program dependence transitive closures and token co-occurrences. In the meantime, the inherent self-attention mechanism is utilized to learn which of the allowed attentions are more important compared to others. To realize the idea, we enhance the vanilla tokenization and model architecture of a BERT model, construct and utilize attention masks, and introduce a new pre-training algorithm. We pre-train this BERT-like model from scratch, using a dataset of 26 million stripped binary functions with explicit program dependence information extracted by our tool. We apply the model in three downstream tasks: binary similarity, type inference, and malware family classification. Our pre-trained model can improve the SOTAs in these tasks from 53% to 64%, 49% to 60%, and 74% to 94%, respectively. It also substantially outperforms other general pre-training techniques of code understanding models.

Artificial Intelligence

Theoretical Computer Science

0

Paper

Artificial Intelligence

Theoretical Computer Science

Save

OdScan: Backdoor Scanning for Object Detection Models

Siyuan Cheng et al.May 19, 2024

Artificial Intelligence

Industrial And Manufacturing Engineering

0

Paper

Artificial Intelligence

Industrial And Manufacturing Engineering

Save

Lotus: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Siyuan Cheng et al.Jun 16, 2024

Artificial Intelligence

Signal Processing

0

Paper

Artificial Intelligence

Signal Processing

Save

ReSym: Harnessing LLMs to Recover Variable and Data Structure Symbols from Stripped Binaries

Dongji Xie et al.Dec 2, 2024

Decompilation aims to recover a binary executable to the source code form and hence has a wide range of applications in cyber security, such as malware analysis and legacy code hardening. A prominent challenge is to recover variable symbols, including both primitive and complex types such as user-defined data structures, along with their symbol information such as names and types. Existing efforts focus on solving parts of the problem, e.g., recovering only types (without names) or only local variables (without user-defined structures). In this paper, we propose ReSym, a novel hybrid technique that combines Large Language Models (LLMs) and program analysis to recover both names and types for local variables and user-defined data structures. Our method encompasses fine-tuning two LLMs to handle local variables and structures, respectively. To overcome the token limitations inherent in current LLMs, we devise a novel Prolog-based algorithm to aggregate and cross-check results from multiple LLM queries, suppressing uncertainty and hallucinations. Our experiments show that ReSym is effective in recovering variable information and user-defined data structures, substantially outperforming the state-of-the-art methods.

Artificial Intelligence

Information Systems

0

Paper

Artificial Intelligence

Information Systems

Save