From Pixels to Wireframes: 3D Reconstruction via CLIP-Based Sketch Abstraction

Abstract

Sketch abstraction captures objects using minimal yet expressive visual elements. While recent advances have enabled the generation of 2D sketches from images, such representations remain limited to flat, two-dimensional views. In this project, we introduce a novel extension of sketch abstraction into 3D by optimizing Bézier curves using a constrained Gaussian splatting approach. We specifically restrict Gaussians to matte, single-color, spherical blobs that are placed strictly along Bézier curve paths. This constraint enforces a structured and interpretable wireframe representation while preserving the essence of visual abstraction. Our method generates 3D wireframe-like sketches that serve as compact and meaningful representations of objects, demonstrating strong qualitative results.

Overview

Introduction

Related Works

Methodology

Results

Conclusion and Limitations

I. Introduction

Sketch-based abstraction serves as a powerful mechanism for compactly encoding the essential structure and semantics of objects. While recent developments such as CLIPasso [1] have demonstrated the effectiveness of semantic supervision via CLIP [2] for generating 2D Bézier-based sketches from images, these methods are inherently restricted to two-dimensional representations. As such, they fail to capture the spatial consistency and geometric coherence required for applications involving 3D perception, modeling, and interaction.

This work addresses the extension of sketch abstraction to three dimensions by proposing a fully differentiable framework for 3D sketch reconstruction. Specifically, we introduce a method that optimizes a set of 3D Bézier curves, discretized via spherical Gaussian primitives, to form structured and semantically meaningful wireframe representations. Each curve is modeled as a continuous path sampled into isotropic Gaussian kernels, constrained to lie in 3D space, and rendered through a differentiable rasterization pipeline.

To supervise the optimization process, we leverage a composite CLIP-based objective comprising both global semantic alignment, captured via the final projection layer of a pretrained CLIP model, and localized geometric correspondence, derived from intermediate convolutional activations or attention maps. The optimization proceeds by iteratively projecting the 3D sketch into 2D views, computing perceptual similarity with reference RGB images, and backpropagating through the rendering pipeline to refine the control points of the Bézier curves.

By constraining the sketch abstraction problem to a subset of differentiable Gaussian splatting [3] —where the means are fixed along Bézier paths and the variances and colors remain constant, we obtain a compact and interpretable 3D representation. This formulation enables efficient optimization, multi-view supervision, and strong semantic fidelity without requiring explicit 3D ground truth. The resulting sketch abstractions demonstrate both structural coherence and semantic alignment across a variety of synthetic object datasets.

The task of generating semantically meaningful 3D sketch abstractions has gained recent attention through works such as 3Doodle [4] and Diff3DS [5], both of which propose novel pipelines for constructing view-consistent 3D stroke representations. These approaches differ from ours in their design choices: 3Doodle employs neural abstraction from mesh data via 3D stroke optimization, while Diff3DS leverages differentiable curve rendering combined with Score Distillation Sampling (SDS) method which utilizes diffusion generation for text2Image and Image2Image synthesis which is commonly used in 3D domain. These methods validate the feasibility of 3D sketch generation using neural optimization pipelines under perceptual or geometric constraints.

In contrast, our approach formulates the problem as a constrained instance of Gaussian Splatting, wherein each 3D Bézier curve is discretized into a series of spherical Gaussians with fixed appearance parameters. These primitives are optimized under CLIP-based semantic and geometric losses, enabling an interpretable and differentiable 3D sketch abstraction framework without relying on explicit depth supervision or mesh inputs. Our work builds on two key prior directions: (i) CLIP-driven 2D sketch abstraction and (ii) real-time differentiable rendering with Gaussian primitives, discussed in detail below.

CLIPasso: Semantically-Aware Object Sketching [1]

CLIPasso introduced a novel sketch abstraction pipeline driven by CLIP embeddings. It optimizes 2D Bézier curves to semantically match target images, generating vector sketches that maintain both recognizability and abstraction. CLIP-based supervision enables sketching without pixel-wise supervision, focusing instead on perceptual similarity.

Sketches of objects obtained from 2D images using CLIPasso.

However, CLIPasso operates purely in 2D and does not generalize to multiview-consistent or 3D-aware sketching tasks.

3D Gaussian Splatting for Real-Time Radiance Field Rendering [3]

This work replaces dense volumetric representations with a sparse set of view-adaptive anisotropic Gaussians, achieving high-quality and real-time radiance field rendering. The technique offers a fully differentiable rendering pipeline ideal for interactive and fast applications.

Representing 3D scenes using Gaussians

While highly efficient, this work focuses on photorealistic synthesis and lacks semantic constraints. Our work adapts its rasterization foundation for use with sparse Bézier-based sketches guided by perceptual objectives.

Combining the semantic sketching approach of CLIPasso with the real-time rendering capabilities of Gaussian Splatting, our framework introduces a differentiable 3D sketch representation optimized using CLIP-based supervision. This positions our work uniquely at the intersection of 3D vision, neural rendering, and semantic abstraction.

III. Methodology

In this section, we present the three key building blocks of our framework. First, we introduce a fully differentiable rasterization pipeline in which each Bézier curve is modeled as a sequence of spherical Gaussians sampled along its path and projected into 2D via camera intrinsics and extrinsics.

To steer our 3D Bézier sketch toward both semantic fidelity and geometric precision, we use a CLIP-based loss function: a high-level semantic similarity term, aligning rendered views in the semantic space, and a geometric alignment term that preserves pixel-wise detail.

Finally, we describe the model training, which alternates between projecting the current 3D curve set into 2D, computing the CLIP loss against ground-truth images for the current view, and backpropagating through the differentiable renderer to refine the Bézier control points until convergence.

III-A. 3D Differentiable Rasterization with Spherical Gaussians

To represent 3D sketches in a differentiable manner, we model 3D Bézier curves using spherical Gaussians (SGs) sampled along each curve’s path. This design allows efficient rasterization of the sketches into 2D images for training. Unlike CLIPasso’s discrete 2D rasterization, our approach supports backpropagation in 3D space and enables full differentiability with respect to the control points. Inspired by recent advances in Gaussian Splatting, we represent Bézier curves as sequences of Gaussians, each defined by a center and a fixed thickness, that can be projected onto 2D views using differentiable camera models.

3D Bezier curves are used for representing 3D sketches of objects.

3D Bézier curves are used for representing 3D sketches of objects.

Each spherical Gaussian is defined as:

$\phi\left(\mathbf{x}; \mathbf{c}, r\right) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{c}\|^2}{2r^2}\right)$

where $\mathbf{c} \in \mathbb{R}^3$ is the center of the Gaussian and $r$ is its radius (or thickness). To sample a Bézier curve of length $L$ with a desired overlap ratio $\alpha_o$ , the sampling step $d$ and the number of samples $N$ are computed as:

$d = r \alpha_o, \quad N = \left\lceil \frac{L}{d} \right\rceil$

For a Bézier curve $\mathbf{B}$ with control points $\mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_M \in \mathbb{R}^3$ , the centers of the Gaussians sampled along the curve are given by:

$\mathbf{c}_{n,\mathbf{B}} = \sum_{i=0}^{M} \gamma_i\,\left(1-\frac{n}{N}\right)^{M-i}\,\left(\frac{n}{N}\right)^{i}\,\mathbf{p}_i$

where $\gamma_i$ are the Bernstein basis coefficients. This makes each Gaussian center fully differentiable with respect to the Bézier control points.

Spherical Gaussians are utilized for differentiable rasterization of the 3D sketches.

After generating the Gaussians, we render them through a differentiable rasterizer that projects the 3D Gaussians into 2D using camera intrinsics and extrinsics [6]. The resulting image $\mathbf{V} \in \mathbb{R}^{W \times H}$ is given by:

$\mathbf{V} = T\left(\mathbf{c}, r, \mathbf{In}, \mathbf{Ex}, W, H\right)$

where $\mathbf{In} \in \mathbb{R}^{3 \times 3}$ is the intrinsic matrix encoding camera parameters such as focal length and principal point:

$\mathbf{In} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix},$

and $\mathbf{Ex} \in \mathbb{R}^{4 \times 4}$ is the extrinsic matrix representing the camera pose in the world frame, given by:

$\mathbf{Ex} = \begin{bmatrix} \mathbf{R} & \mathbf{t} \\ \mathbf{0}^{1 \times 3} & 1 \end{bmatrix},$

where $\mathbf{R} \in \mathbb{R}^{3 \times 3}$ is the rotation matrix and $\mathbf{t} \in \mathbb{R}^{3}$ is the translation vector. Together, these matrices map a 3D point from world coordinates to 2D pixel coordinates in the image plane.

Each pixel $\mathbf{V}_{n_1,n_2}$ corresponds to a projection and accumulation over 3D Gaussians, which are parameterized by their mean positions, covariance matrices, and opacity values:

$\mathbf{V}_{n_1,n_2} = T\left(f(\mathbf{p}_i, \boldsymbol{\Sigma}_i, \alpha_i, \ldots), \ldots\right)$

Notably, the strokes used in our sketch abstraction are represented as Bézier curves with 4 control points $\{\mathbf{P}_0, \mathbf{P}_1, \mathbf{P}_2, \mathbf{P}_3\}$ , forming a cubic Bézier formulation:

$\mathbf{B}(t) = (1 - t)^3 \mathbf{P}_0 + 3(1 - t)^2 t \mathbf{P}_1 + 3(1 - t) t^2 \mathbf{P}_2 + t^3 \mathbf{P}_3,\quad t \in [0, 1]$

This parametric formulation enables fine control over stroke shape and spatial distribution in 3D. The end-to-end differentiability of the pipeline allows training the sketch representation directly using image-level loss functions such as CLIP similarity.

III-B. CLIP-based Objective

To guide the optimization of the 3D Bézier sketch toward both geometrically accurate and semantically meaningful representations, we leverage a CLIP-based loss function. This loss encourages the rendered sketch views to match the real images not just pixel-wise but in high-level perceptual and semantic space. Following the strategy of CLIPasso, we define a dual-component objective that combines semantic similarity and geometric alignment, each captured from different layers of a pretrained CLIP model.

The final loss is a weighted sum of the semantic and geometric components:

$\mathcal{L} = w_s \,\mathcal{L}_{\mathrm{semantic}} + w_g \,\mathcal{L}_{\mathrm{geometric}}$

Here, $w_s$ and $w_g$ are hyperparameters that balance the contribution of each term. The semantic loss $\mathcal{L}_{\mathrm{semantic}}$ is computed using the output of CLIP’s final fully connected layer, which captures the global alignment between rendered sketches and their corresponding RGB images in CLIP’s joint vision-language embedding space.

The geometric loss $\mathcal{L}_{\mathrm{geometric}}$ captures local spatial structure and can be extracted either from CLIP’s early convolutional layers or from token-wise attention maps, depending on the selected variant. This term ensures that the layout and fine-scale details of the projected sketches align with real-world object contours.

During training, the 3D sketch is projected into 2D using the differentiable rasterizer described in Section III-A. The resulting image is then passed through CLIP alongside its target RGB counterpart. The cosine distance between their embeddings drives the optimization, gradually refining the control points of the Bézier curves so that the rendered sketches become increasingly perceptually and semantically aligned with real views.

III-C. Model Training

The training phase aims to optimize a set of 3D Bézier curves such that their 2D projections resemble real object images when viewed from different camera angles. This is achieved through an iterative loop involving projection, perceptual comparison, and gradient-based refinement. By leveraging CLIP embeddings as a supervisory signal, the model learns to align the rendered sketch views with the semantics of real RGB views. The training continues until convergence, producing a geometry-aware sketch representation that is both visually and semantically faithful.

Rasterization of images using camera matrix from the constructred 3D sketch.

In each training iteration, the current 3D Bézier sketch is projected into 2D using the known camera intrinsics and extrinsics. This rasterization step generates a view-dependent binary sketch image that mimics how the 3D curves would appear from that specific viewpoint.

Obtaining the CLIP Loss between the RGB images of the views and sampled sketch views.

The projected sketch view is then compared to the corresponding RGB image of the same view. Both are encoded using CLIP’s vision-language model, and a cosine similarity loss is computed to quantify their perceptual alignment in the embedding space.

Optimizing the control points using the backpropagated loss.

The CLIP loss is backpropagated through the differentiable projection and rendering pipeline to adjust the 3D control points of the Bézier curves. This step ensures that the curves evolve to better match the visual content of the reference images.

Training the model until loss converges.

This process is repeated iteratively for all views in the dataset. Over time, the loss steadily decreases as the curves converge to a semantically accurate and geometrically coherent representation of the object across all viewpoints.

IV. Results

In this section, we present the results of our 3D sketch abstraction pipeline. We begin by introducing the custom multiview datasets we created, then illustrate the training process using train/test splits to evaluate spatial consistency throughout optimization. Finally, we report the outcomes of a series of experiments that vary key hyperparameters.

IV-A. Custom Blender Datasets

Classical NeRF datasets emphasize high‐resolution detail and sharp features to benchmark rendering accuracy. Such geometric complexity can impede our abstraction‐focused optimization. Therefore, we generated our own multiview datasets using semantically and visually simple objects using Blender. To assess the performance of the used method, we have generated both training and the testing views.

IV-B. Training - Train & Test Splits

To verify that our 3D sketch abstractions remain spatially consistent from viewpoints not used during optimization, we split each multiview dataset into training and test sets. During training, curve parameters are updated using the CLIP-based loss computed on the training views. Simultaneously, at each iteration, we render the current sketch from unseen test viewpoints and compute the CLIP loss against the corresponding test images. Below, we present the evolution of both training and test losses throughout the optimization process, along with visual snapshots illustrating the evolution of the 3D curves at different stages.

The loss values of the train and test splits for each object through iterations.

IV-C. Experiments with Hyperparameters

To evaluate the impact of key hyperparameters on both the convergence behavior and the visual fidelity of our 3D sketch abstractions, we performed a series of controlled experiments. In each case, a single parameter was varied while all others were held constant. Below, we present qualitative results illustrating how these parameters influence the final renderings. An extensive list of hyperparameters that can be passed through the parser can be seen from the list of hyperparameters below.

Argument	Description
`--batch_size`	Number of images processed per training batch.
`--epochs`	Number of training epochs.
`--inner_steps`	Inner optimization steps per batch.
`--learning_rate`	Learning rate for Adam optimizer.
`--n_curves`	Number of Bézier curves initialized in the scene.
`--thickness`	Radius (thickness) of each rendered curve/sphere segment.
`--radius`	Distance from the scene center to place initial curve points.
`--length`	Initial length of the randomly created curve segments.
`--overlap`	Degree of allowed overlap between nearby spherical Gaussians on the same curve.
`--clip_weight`	Total CLIP loss weight.
`--clip_conv_loss`	Weight of the convolutional CLIP loss.
`--clip_fc_loss_weight`	Weight of CLIP’s final-layer (semantic) similarity loss.
`--clip_conv_layer_weights`	Weights for each CLIP convolutional layer (ViT-B/32 has 5 layers).

Number of curves

This hyperparameter controls how many 3D Bézier curves are used to represent the sketch. A higher number of curves allows for more detailed reconstruction, while fewer curves enforce abstraction. We experimented with 10, 15, 20, and 25 curves. Results are shown below for both the rose and duck objects.

Batch Size

The batch size controls how many camera views are processed jointly during each optimization step. A larger batch size enables more consistent updates across viewpoints, potentially improving multiview coherence. We tested batch sizes of 1, 2, 3, and 4 on the duck object.

Semantic Loss Weight

This weight controls the influence of high-level semantics (CLIP FC loss) during optimization. A low value prioritizes geometric alignment, while a higher value emphasizes perceptual similarity with the reference image. We experimented with weights of 0.01 and 0.5 using the rose object.

Curve Thickness

The stroke thickness affects how bold each curve appears in the sketch. While it does not impact optimization, it influences visual clarity and stylistic choices. We experimented with thickness values of 0.02 and 0.03.

IV-D. Interactive Visualisation

Below is an interactive rendering of the final 3D sketch of the rose object. You may rotate and zoom to inspect the stroke configuration from different viewpoints.

V. Conclusion and Limitations

Our work explores the use of CLIP-based objectives for 3D sketch abstraction via differentiable Bézier curves. Through our implementation and experimentation, we make the following observations:

Depth ambiguity in single-view optimization leads to visually plausible but geometrically inconsistent 3D reconstructions.
Multi-view supervision stabilizes depth inference and leads to spatially coherent sketches across viewpoints.
Differentiable Gaussian rasterization enables gradient-based optimization over curve control points, but memory efficiency is critical to scale.
The 3D sketch generation task can be reduced to a Gaussian Splatting problem with alpha, color and variance parameters are fixed and mean values of the splats are subject to constraints as Beizer Curve control points:

These findings highlight the importance of viewpoint diversity and geometry-aware initialization in abstract 3D reconstruction tasks, and support the use of CLIP-based semantic loss as a viable surrogate when explicit 3D supervision is unavailable.

V-A. Limitations and Future Work

Although we have obtained promising results, there are several limitations in our work that could be addressed in future research:

Multi-view consistency remains one of the most critical and challenging objectives in the field of 3D image generation and editing. In our work, despite employing batches of multi-view images and experimenting with various hyperparameter settings, we observed failures in achieving semantic consistency across views for certain objects. Notably, we found that the CLIP-based loss occasionally overemphasizes semantic similarity at the expense of geometric coherence. For example, in the case of the horse model, the object may resemble a full horse from one viewpoint but appear as only a horse’s head from another. A similar inconsistency is evident in the GIF below, where the bicycle sketch exhibits view-dependent deformation appearent from the tires. Addressing this issue requires further refinement of the training strategy—particularly in the initialization and optimization of Bézier control points. Additionally, a more careful tuning of the CLIP loss weighting parameters may help balance semantic alignment with geometric structure across views.

The bicycle sketch exhibits view-dependent deformation apparent from the tires

The bicycle reconstruction exhibits view-dependent deformation apparent from the tires

Another challenge encountered during training is the tendency of the optimization process to overfill the volume of the object when the number of Bézier curves becomes too high. As the sketch density increases, the CLIP-based loss drives the model to generate a solid, filled-in representation rather than preserving the sparse, abstract nature of a line-based sketch. This behavior may be partially related to the previously discussed issue, where the CLIP loss struggles to maintain a proper balance between semantic and geometric alignment across views. However, it also appears to be influenced by the number of sketch curves: as more curves are introduced, the optimization increasingly prioritizes semantic coverage, effectively “filling in” the shape. This effect is illustrated in the duck reconstruction GIF below, where the sketch—composed of 40 curves—visibly fills the object’s interior volume rather than outlining its structural features.

The duck reconstruction exhibits a problem of volume-filling when number of lines to model the sketch is increased.

To address this problem, additional geometric constraints on curve proximity could be introduced to discourage excessive overlap. Alternatively, further improvements to the semantic-geometric balance described earlier may also alleviate this behavior. Exploring dynamic loss weighting strategies or incorporating structural priors remains a promising direction for future work.

The initialization strategy for Bézier curves plays a critical role in achieving high-quality sketch reconstruction. In CLIPasso, curve initialization is guided by thresholding saliency maps produced by the CLIP model, enabling semantically meaningful placements in 2D. However, this approach does not generalize well to 3D, where surface geometry must be taken into account. To keep the scope of the project focused and tractable, we chose not to train a NeRF or Gaussian Splatting model to obtain 3D saliency maps for curve placement. Instead, we adopted a random initialization strategy, which, despite its simplicity, yielded reasonably good results. Nevertheless, we believe that the visual quality and consistency of the generated sketches could be significantly improved with more informed initialization schemes—particularly those that incorporate geometric priors or 3D-aware saliency cues.

In this project, we developed a novel sketch generation technique built upon the principles of Gaussian splatting, enhanced with task-specific constraints. While the visual quality of our results may fall short compared to state-of-the-art methods in the literature, the framework and research introduced here offer strong potential for reuse and extension. In particular, our approach provides a foundation for future academic work focused on task-oriented, constrained 3D applications—an area that remains largely underexplored and rich with opportunity.

References

Y. Vinker, E. Pajouheshgar, J. Y. Bo, R. C. Bachmann, A. H. Bermano, D. Cohen-Or, A. Zamir, and A. Shamir, “CLIPasso: Semantically-Aware Object Sketching,” ACM Transactions on Graphics, vol. 41, no. 4, art. no. 86, pp. 1–11, Jul. 2022, https://doi.org/10.1145/3528223.3530068.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, "Learning Transferable Visual Models From Natural Language Supervision," in Proc. 38th Int. Conf. Machine Learning (ICML), vol. 139, pp. 8748–8763, 2021, https://doi.org/10.48550/arXiv.2103.00020.
B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3D Gaussian Splatting for Real-Time Radiance Field Rendering,” ACM Transactions on Graphics, vol. 42, no. 4, Article 139, pp. 1–14, Jul. 2023, https://doi.org/10.1145/3592433.
C. Choi, J. Lee, J. Park, and Y. M. Kim, “3Doodle: Compact Abstraction of Objects with 3D Strokes,” ACM Transactions on Graphics, vol. 43, no. 4, art. no. 107, pp. 1–13, Jul. 2024, https://doi.org/10.1145/3658156.
Y. Zhang, L. Wang, C. Zou, T. Wu, and R. Ma, “Diff3DS: Generating View-Consistent 3D Sketch via Differentiable Curve Rendering,” in Proc. 13th Int. Conf. Learning Representations (ICLR), 2025, https://doi.org/10.48550/arXiv.2405.15305.
R. Szeliski, Computer Vision: Algorithms and Applications, 2nd ed. Cham, Switzerland: Springer, 2022. [Online]. Available: https://doi.org/10.1007/978-3-030-34372-9.