This is a sample data pipeline implemented using the Transposit platform.
This data is all fake. It is pulled from this sample data.
The daily sales data that is dropped into an S3 bucket by some other process.
We want to process that data and enrich it with additional information. We only want to process new orders. For these orders, we want to add who the sales lead was and the the inventory level. For this example, these enrichments are mocked up, but you could easily add a new data connector which would reach out to internal APIs.
We want to push a summation of each region's sales totals to a Google spreadsheet (perhaps for executive dashboard) and add all new orders to a data warehouse.
The pipeline looks like this:
S3 file -> filter out old orders -> add sales leads -> add inventory data -> sum up sales -> update Google sheets -> update BigQuery
You need to set up the following external resources:
upload
and a processed
folder. Download the 100 Sales Records.csv
file from the sample data and upload it to the upload
folder.AmazonS3FullAccess
permissions, or at a minimum read/write permissions for the bucket you created above. You'll need the AWS_ACCESS_KEY
and AWS_SECRET_ACCESS_KEY
for this user.default
.orderdata
. Create the table from a file upload (of the sample sales data) and select 'Auto detect' for the schema, so that the schema gets picked up from the CSV file.delete from \
default.orderdata` where 1=1(or just delete some of them:
delete from `default.orderdata` where Region='Europe'`)Create a free Transposit account.
scheduled_task
operation, which contains the S3 manipulations, the pipeline_top
operation which documents the pipeline, and the add_sales_rep
operation, which adds in some sales data.Custom timeout
to 300000
(5 minutes) for the scheduled_task
operation by going to the Properties tab.scheduled_task
operation every 10 minutes: 37 /10 * ? * *
You can also run the pipeline by clicking "Run now & show log". You should then see records start to appear in the BigQuery table.