Organising Large Projects with Sub-Pipelines

This vignette introduces rxp_pipeline(), a function for organising large projects into logical sub-pipelines. This feature is particularly useful when working on complex projects with multiple phases (e.g., ETL, Modelling, Reporting) or when collaborating in teams where different members work on different parts of the pipeline.

Large Pipelines Become Unwieldy

As pipelines grow, a single gen-pipeline.R file can become difficult to manage. Consider a data science project with: - Data extraction and cleaning (ETL) - Feature engineering - Model training - Model evaluation - Report generation

Putting all derivations in one file makes it hard to:

Navigate the code
Understand which derivations belong to which phase
Collaborate across team members
Reuse pipeline components in other projects

To solve this issue, you can define your project using sub-pipelines and join them into a master pipeline using rxp_pipeline().

This allows you to:

Organise derivations into named groups
Colour-code groups for visual distinction in DAG visualisations
Modularise your code across multiple R scripts

Basic Usage

A project with sub-pipelines would look something like this:

my-project/
├── default.nix           # Nix environment (generated by rix)
├── gen-env.R             # Script to generate default.nix
├── gen-pipeline.R        # MASTER SCRIPT: combines all sub-pipelines
└── pipelines/
    ├── 01_data_prep.R    # Data preparation sub-pipeline
    ├── 02_analysis.R     # Analysis sub-pipeline
    └── 03_reporting.R    # Reporting sub-pipeline

Each sub-pipeline file returns a list of derivations:

# Data Preparation Sub-Pipeline
# pipelines/01_data_prep.R
library(rixpress)

list(
  rxp_r(name = raw_mtcars, expr = mtcars),
  rxp_r(name = clean_mtcars, expr = dplyr::filter(raw_mtcars, am == 1)),
  rxp_r(name = selected_mtcars, expr = dplyr::select(clean_mtcars, mpg, cyl, hp, wt))
)

The rxp_pipeline() function takes:

name: A descriptive name for this group of derivations
path: Either a file path to an R script returning a list of derivations (recommended), or a list of derivation objects.
color: Optional CSS color name or hex code for DAG visualisation

The second sub-pipeline:

# Analysis Sub-Pipeline
# pipelines/02_analysis.R
library(rixpress)

list(
  rxp_r(name = summary_stats, expr = summary(selected_mtcars)),
  rxp_r(name = mpg_model, expr = lm(mpg ~ hp + wt, data = selected_mtcars)),
  rxp_r(name = model_coefs, expr = coef(mpg_model))
)

The master script becomes very clean, as rxp_pipeline handles sourcing the files:

# gen-pipeline.R
library(rixpress)

# Create named pipelines with colours by pointing to the files
pipe_data_prep <- rxp_pipeline(
  name = "Data Preparation",
  path = "pipelines/01_data_prep.R",
  color = "#E69F00"
)

pipe_analysis <- rxp_pipeline(
  name = "Statistical Analysis",
  path = "pipelines/02_analysis.R",
  color = "#56B4E9"
)

# Build combined pipeline
rxp_populate(list(pipe_data_prep, pipe_analysis), project_path = ".", build = TRUE)

Visualising Sub-Pipelines

When sub-pipelines are defined, visualisation tools use pipeline colours:

Interactive Network (rxp_visnetwork()) and Static DAG (rxp_ggdag()) both use a dual-encoding approach:
- Node fill (interior): Derivation type colour (R = blue, Python = yellow, etc.)
- Node border (thick stroke): Pipeline group colour This allows you to see both what type of computation each node is and which pipeline it belongs to.

Trace: rxp_trace() output in the console is coloured by pipeline (using the cli package).

If your terminal supports it, derivation names are coloured according to the chosen sub-pipeline colour.

Switching Between Colour Modes

# Dual encoding: fill = type, border = pipeline (default when pipelines are defined)
rxp_ggdag(color_by = "pipeline")

# Colour entirely by derivation type (rxp_r, rxp_py, etc.) - original behaviour
rxp_ggdag(color_by = "type")

How It Works Internally

When you call rxp_populate() with rxp_pipeline objects:

Flattening: Pipelines are flattened to a single list of derivations
Metadata Preservation: Each derivation retains pipeline_group and pipeline_color
DAG Generation: dag.json includes pipeline metadata
Visualisation: rxp_visnetwork() and rxp_ggdag() read this metadata

Best Practices

Use descriptive pipeline names: “Data Preparation” is better than “ETL”
Choose contrasting colours: Use ColorBrewer palettes
Keep sub-pipelines focused: One logical phase per sub-pipeline
Order your files: Use numeric prefixes (01_, 02_, etc.)

Conclusion

rxp_pipeline() provides a simple yet powerful way to organise complex pipelines. By grouping derivations into logical units, you can:

Keep your code organised and maintainable
Enable team collaboration on different parts of the pipeline
Visualise the structure of your workflow with meaningful colours
Reuse sub-pipelines across projects

For a working example, see the subpipelines demo in the rixpress_demos repository.