Dataflow CRD

API Reference

Packages:

bytewax.io/v1alpha1

Resource Types:

Dataflow

↩ Parent

Dataflow is the Schema for the dataflows API

Name Type Description Required
apiVersion string bytewax.io/v1alpha1 true
kind string Dataflow true
metadata object Refer to the Kubernetes API documentation for the fields of the `metadata` field. true
spec object DataflowSpec defines the desired state of Dataflow
false
status object DataflowStatus defines the observed state of Dataflow
false

Dataflow.spec

↩ Parent

DataflowSpec defines the desired state of Dataflow

Name Type Description Required
image object Dataflow container image settings
true
pythonFileName string Python script file to run
true
artifactsDownload object Downloads a tar file from a URL. It could be a public URL or a private GitHub repository
false
chartValues string Advanced Bytewax helm chart values. See more at https://bytewax.github.io/helm-charts
false
concurrencyPolicy string Specifies how to treat concurrent executions of a Scheduled Dataflow. Valid values are: - "Allow": allows Dataflows to run concurrently; - "Forbid": (default) forbids concurrent runs, skipping next run if previous run hasn't finished yet; - "Replace": cancels currently running Dataflow and replaces it with a new one
false
configMapName string Dataflow Configmap name
false
dependencies []string Python dependencies needed to run the dataflow (use pip syntax, e.g. package-name==0.1.0)
false
env []object Environment variables to inject to dataflow container
false
jobMode boolean Run a Job in Kubernetes instead of an Statefulset. Use this to batch processing

Default: false
false
keepAlive boolean Keep the Dataflow container alive after dataflow ends. It could be useful for troubleshooting purposes

Default: false
false
processesCount integer Number of processes to run

Default: 1
false
recovery object Stores recovery files in Kubernetes persistent volumes to allow resuming after a restart (your dataflow must have recovery enabled: https://bytewax.io/docs/getting-started/recovery)
false
schedule string Dataflow schedule in Cron format, see https://en.wikipedia.org/wiki/Cron
false
suspend boolean Suspends the Dataflow execution. For Scheduled Dataflows, it will suspend subsequent executions, it does not apply to already started executions.

Default: false
false
tarName string Tar file name stored in the dataflow configmap
false
workersPerProcess integer Number of workers to run in each process

Default: 1
false

Dataflow.spec.image

↩ Parent

Dataflow container image settings

Name Type Description Required
tag string Container image tag
true
pullPolicy string Container image pull policy (the value must be Always, IfNotPresent or Never)

Default: Always
false
pullSecret string Kubernetes secret name to pull images

Default: default-credentials
false
repository string Container image repository URI
false

Dataflow.spec.artifactsDownload

↩ Parent

Downloads a tar file from a URL. It could be a public URL or a private GitHub repository

Name Type Description Required
url string Url of the tar file to download
true
secretName string Name of the Kubernetes secret storing Personal Access Token. It must contain the key TOKEN
false
token string Personal Access Token
false

Dataflow.spec.env[index]

↩ Parent

Name Type Description Required
name string Environment variable name
true
value string Environment variable value
true

Dataflow.spec.recovery

↩ Parent

Stores recovery files in Kubernetes persistent volumes to allow resuming after a restart (your dataflow must have recovery enabled: https://bytewax.io/docs/getting-started/recovery)

Name Type Description Required
backupInterval integer System time duration in seconds to keep extra state snapshots around

Default: 1
false
cloudBackup object Back up worker state DBs to cloud storage
false
enabled boolean Enable Dataflow recovery feature

Default: false
false
partsCount integer Number of recovery partitions

Default: 1
false
persistence object Kubernetes Persistence settings
false
singleVolume boolean Use only one persistent volume for all dataflow's pods in Kubernetes

Default: false
false
snapshotInterval integer System time duration in seconds to snapshot state for recovery

Default: 1
false

Dataflow.spec.recovery.cloudBackup

↩ Parent

Back up worker state DBs to cloud storage

Name Type Description Required
enabled boolean Enables Cloud Backup feature

Default: false
false
s3 object Cloud Backup S3 settings
false

Dataflow.spec.recovery.cloudBackup.s3

↩ Parent

Cloud Backup S3 settings

Name Type Description Required
url string S3 url to store state backups. For example, s3://mybucket/mydataflow-state-backups
true
accessKeyId string AWS credentials access key id
false
secretAccessKey string AWS credentials secret access key
false
secretName string Name of the Kubernetes Secret that stores AWS credentials. It must contain the keys LITESTREAM_ACCESS_KEY_ID and LITESTREAM_SECRET_ACCESS_KEY
false

Dataflow.spec.recovery.persistence

↩ Parent

Kubernetes Persistence settings

Name Type Description Required
size string Size of the persistent volume claim to be assign to each dataflow pod in Kubernetes

Default: 10Gi
false
storageClassName string Storage class of the persistent volume claim to be assign to each dataflow pod in Kubernetes
false

Dataflow.status

↩ Parent

DataflowStatus defines the observed state of Dataflow

Name Type Description Required
conditions []object INSERT ADDITIONAL STATUS FIELD - define observed state of cluster Important: Run "make" to regenerate code after modifying this file
false

Dataflow.status.conditions[index]

↩ Parent

Condition contains details for one aspect of the current state of this API Resource. --- This struct is intended for direct use as an array at the field path .status.conditions. For example, type FooStatus struct{ // Represents the observations of a foo's current state. // Known .status.conditions.type are: "Available", "Progressing", and "Degraded" // +patchMergeKey=type // +patchStrategy=merge // +listType=map // +listMapKey=type Conditions []metav1.Condition json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type" protobuf:"bytes,1,rep,name=conditions" // other fields }

Name Type Description Required
lastTransitionTime string lastTransitionTime is the last time the condition transitioned from one status to another. This should be when the underlying condition changed. If that is not known, then using the time when the API field changed is acceptable.

Format: date-time
true
message string message is a human readable message indicating details about the transition. This may be an empty string.
true
reason string reason contains a programmatic identifier indicating the reason for the condition's last transition. Producers of specific condition types may define expected values and meanings for this field, and whether the values are considered a guaranteed API. The value should be a CamelCase string. This field may not be empty.
true
status enum status of the condition, one of True, False, Unknown.

Enum: True, False, Unknown
true
type string type of condition in CamelCase or in foo.example.com/CamelCase. --- Many .condition.type values are consistent across resources like Available, but because arbitrary conditions can be useful (see .node.status.conditions), the ability to deconflict is important. The regex it matches is (dns1123SubdomainFmt/)?(qualifiedNameFmt)
true
observedGeneration integer observedGeneration represents the .metadata.generation that the condition was set based upon. For instance, if .metadata.generation is currently 12, but the .status.conditions[x].observedGeneration is 9, the condition is out of date with respect to the current state of the instance.

Format: int64
Minimum: 0
false