WaDec: Decompiling WebAssembly Using Large Language Model

—

by

in Paper Reviews

Background

Previous WASM decompilers have poor readability.
Existing LLM decompilers are not specialized for WASM binaries.
- Stack-based architecture makes the assembly very uncomprehensive.
- Cannot handle fine-grained snippets.
- Cannot handle complicated structures, including nested loops.

Method

1. Dataset construction

Collected about 52,000 C programs.
Consists of (WAT, C, Spatial_info, Temporal_info, Offset2String).
Main idea: String Substitution -> Variable Renaming -> Snippet Slicing

WAT(Webassembly Text Format)	Snippet of human-readable format of WASM binary
C	Snippet of original source code
Spatial_info	Function signature (parameters & return values)
Temporal_info	Local variables defined before the snippet
Offset2String	Mapping from offsets to string constants

String Substitution

String substitution enables the model to recover the string value correctly.
Every string is substituted with an offset from the data segment. (Mapped in ‘Offset2String’)

Variable Renaming

Variable renaming improves the consistency in recovered variable names.
Renamed the variables of the source code and wat snippets into ‘local_N’
A variable name is mapped with a specific offset in a function using DWARF information.

Slicing

Slicing improves nested loop handling capability.
Codes are sliced to include at most one loop statement.

2. Modeling

Fine-tuned based on CodeLLaMa-7b-hf.
Next token prediction (CAUSAL_LM task type)
Synthesized a prompt (p_n) to train the model.
- In the evaluation prompt, c_n is excluded.

i	Instruction (i.e. You are a decompiler…)
Vb_n	Previously defined variables
d_n	Function signatures
w_n	WAT code
c_n	C code

Evaluation

Similarity

Evaluated with the test set included in the dataset. About 4,000 C programs were used.
Overall, the decompiled code shows high similarity to the original source code.
Metrics
- C@Func = (# functions in decompiled code)/(# total functions)

AED-S (Normalized AST edit distance)	AST similarity (higher is better)
CCN (Cyclomatic complexity)	Program complexity (higher is better)
COS (Cosine similarity)	Token similarity (higher is better)
CodeBLEU	Structural & dataflow similarity in AST-level (higher is better)
C@Func (Function completeness)	Ratio of functions recovered (higher is better)
Bloat Rate	Expansion in code lines (lower is better)
C@Syntax (Syntax completeness)	Ratio of syntactically flawless statements (higher is better)

Re-executability

Acceptable recompilation and re-execution rate
Low output consistency

Discussion

Variable renaming in the aspect of readability

Variable renaming might be effective for consistency.
However, it can cause bad effect to the decompiled code in terms of readability.
It might be improved by post-processing the variable names using LLM.

CodeBLEU score

The CodeBLEU score was 0.6353, which is relatively low.
The authors insist that the CodeBLEU cannot accurately measure the structural similarity.
- This might be due to the rich semantics of C language.

Consistency

The experiment shows that the outputs of the re-compiled binaries are not consistent.
It might be challenging to resolve this at this moment, as it is the fundamental issue of the language model.

Evaluated only for non-optimized binaries

More optimizations may lead the model accuracy to be lowered.

Comments

Leave a Reply Cancel reply