Abstract

We present a modular multi-view consistent style transfer network architecture MuVieCAST that enables consistent style transfer between multiple viewpoints of the same scene. This network architecture supports both sparse and dense views, making it versatile enough to handle a wide range of multi-view image datasets. The approach consists of three modules that perform specific tasks related to style transfer, namely content preservation, image transformation, and multi-view consistency enforcement. We evaluate our approach extensively across multiple application domains including depth-map-based point cloud fusion, mesh reconstruction, and novel-view synthesis. The results demonstrate that the framework produces high-quality stylized images while maintaining consistency across multiple views, even for complex styles that involve mosaic tessellations or extensive brush strokes. Our modular multi-view consistent style transfer framework is extensible and can easily be integrated with various backbone architectures, making it a flexible solution for multi-view style transfer.

Description of the image

Paper Video

Results

Pointcloud reconstruction

Texturing quality

The bird point cloud reconstructed using depth map fusion from the 49 input images and a style image. This experiment demonstrates the stylization capability of our network (CasMVSNet_UNet) to reveal finer details in textured areas, even in shadowed regions.

Description of the image

Reconstruction quality

Comparison of the point clouds reconstructed (CasMVSNet_UNet) from the original input images (left column) and stylized images (right column). Top: colored with original inputs. Middle: colored with stylized colors. Bottom: uniform coloring.

Description of the image

Comparing stylization backbones

Description of the image

Point cloud reconstruction results using PatchmatchNet backbone (Geometry + Coloring). The point clouds (a), (b), and (c) are identical in terms of geometry, with (a) colored by the 64 original input images, (b) colored by the stylized images using the output of Patchmatchnet UNet, and (c) colored by the stylized images using the output of Patchmatchnet AdaIN. (d) and (e) are reconstructed from only the stylized images.

Neural rendering and mesh reconstruction experiments

Description of the image

Neural mesh reconstruction results. First row: CasMVSNet UNet. Second row: CasMVSNet AdaIN. Third row: PatchmatchNet UNet. Fourth row: PatchmatchNet AdaIN. The columns in the figure are as follows: (a) Input image. (b) Stylized image, (c) The stylized mesh surface rendering learned by IDR. (d) The mesh reconstructed from the stylized images. (e) The mesh reconstruction from the original input images.

Mesh editing

Description of the image

Mesh editing effect. The reconstructed mesh demonstrates stripe-like geometric features.

Mesh coloring

Description of the image

Mesh surface coloring (Geometry + Coloring). Meshes (a) and (b) have the same geometric properties but differ in their coloring. Specifically, (a) and (c) are colored using 64 original input images, while (b) and (d) are colored using the stylized images. The geometry of (a) and (b) is derived from the original inputs, while the geometry of (c) and (d) is learned from the stylized images.



Novel view synthesis

NeRF results




Comparison and User Survey

To evaluate the effectiveness of our approach compared to the recent state-of-the-art method Artistic Radiance Fields (ARF) [Zhang et al, ECCV 22], we conducted an anonymous user survey with 40 participants. During the survey, participants were provided with sample scene images and style images. The evaluation was performed on five scenes from the Tanks and Temples dataset each paired with a distinct style image. For each scene, participants were presented with two videos in random order. One video was generated using the ARF method, utilizing ground truth pose information and recommended parameters. To ensure a fair assessment, we adhered closely to the default trajectory employed by ARF. In all NeRF experiments for our method, we extended our evaluation beyond computing radiance fields to include the direct estimation of camera calibrations from stylized images using the colmap SfM.


ARF

The ARF videos were rendered using groundtruth pose information and the recommended parameters and trajectory, which were determined by the author. The code and trajectories for ARF is publicly available at this link.

Ours

To demonstrate the robustness of MV consistency, we used pose information estimated from styled images. To ensure a fair assessment, we made efforts to closely follow to the default trajectory utilized by ARF.

Description of the image

Comparison of Novel-View Synthesis with ARF on five "Tanks and Temples" scenes in five different styles. On the left, sample scene images and style images are displayed. In the middle, the top row with a blue background showcases our results, while the bottom row shows the results of ARF. On the right, our user study results are presented, based on feedback from 40 participants. The results of the survey indicated that in 68% of the cases, participants preferred our results over the ones generated by ARF. Below, you can find the NERF videos (their order were randomized in user study) used in the actual user study.

BibTeX

@InProceedings{ibrahimli2024muviecast,
        author    = {Nail Ibrahimli, Julian F. P. Kooij, and Liangliang Nan},
        title     = {MuVieCAST: Multi-View Consistent Artistic Style Transfer},
        booktitle = {3DV},
        year      = {2024},
      }