Snowflake Business Case: Using Data Factory to ingest from API
A short project covering a proper business case - how to ingest data from an API using Azure Data Factory and load it into Snowflake
Hello guys! I’m happy to see you around here again.
After asking some of my audience, I have realized that what most people are looking for are end-to-end projects and actual business cases that possibly you’ll have to face in your job as a Data Engineer. So, here we are! In this week’s post I’ll show you how you can load data from an API into a Snowflake account using Azure Data Factory. In future posts, I’ll definitely cover many other data sources.
Note on this project
Similarly to our dbt + Snowflake project, I will consider that you know some stuff like creating Azure resources and that you are familiar with some technical definitions like Sources, Sinks or Linked Services.
I will try my best to make this post user-friendly and self-sufficient so you don't need to search for information outside this same page, but again, I am assuming that you have some knowledge on the topics. Note on the image usage in this post
This post will not have a lot of images due to email size limitations, so ideally you'll have to follow-along with your own Azure account and your own Snowflake account. I hope this is not an inconvenient for anyone.
However, I will try to add them whenever they might be completely necessary, but I will skip them for obvious steps (create resource groups, create ADLS, etc...) since, as mentioned in previous note, I'll consider that you already know how to create resources in Azure.Project Architecture
The business need
Since there is no actual business need, let’s get an interesting and entertaining one: professor Oak wants to build a data warehouse for the pokemon data!
It seems that the Pokedex is kind of old school now, and with so many new pokemons, the database is collapsing because of the lack of data modeling :(. Our job as Data Engineers is to fix this problem by creating a data warehouse (a tiny one).
The proposed solution
The proposed solution (and definitely, not the only one) is to extract the data from the API by using Azure Data Factory. Then, we will store this data in json files in a container. Finally, we’ll ingest this data into a raw layer in Snowflake to proceed with the proper transformations. So, as a summary:
Source: Pokemon API (PokeAPI)
Ingestion: Azure Data Factory (ADF)
Files storage: ADLSg2 (ADLS)
SF ingestion, storage and transformations: Snowflake (Snowflake)
API Endpoints
As homework, it could be interesting for you to explore the PokeAPI on your own and, of course, I invite you to follow this project as a template but not doing exactly what I’m doing but picking your favorite pokemon data (or maybe, pick your own API).
For this specific case, I will use the following endpoints:
https://pokeapi.co/api/v2/pokemon?limit={N} → First N Pokemons (I will use N= 151, which is 1st generation)
https://pokeapi.co/api/v2/pokemon/{id or name}/→ Information for a specific pokemon
So, what we are going to do is to build a small ingestion system to get the pokemon data (for the first 151 pokemon) into individual json files. Then, in Snowflake we will create a stage to access our data in ADLS. We will load this raw data into landing tables. Finally, we will separate pokemon data into two tables: pokemon (id, name, height, weight), pokemon_moves (pokemon_id, move).
Snowflake Tables Set Up
Before going ahead and ingesting data or creating resources, we have to define in Snowflake where our data is going to live. This is, a landing/raw/staging table for the data.
As mentioned in the previous section, we’ll be ingesting data for pokemons and items. Let’s go ahead and create our objects in Snowflake:
CREATE OR REPLACE DATABASE POKEMON_DB;
CREATE OR REPLACE SCHEMA POKEMON_DB.RAW;
CREATE OR REPLACE TABLE POKEMON_DB.RAW.RAW_POKEMON (
pokemon_id int,
pokemon_name string,
height int,
weight int,
moves variant
);If you are wondering why I picked variant as data types, maybe you skipped your homework of exploring the API and performing some calls to it.
I picked integers for ID, height and weight because they return as ints from the API. Same for pokemon_name, it comes as string. But for moves, moves is a complex data type (check on a sample API call) so I will store it as variant.
Azure Set Up
Let’s go ahead and create the resources that we need for this project: ADF and ADLS.
Azure Resources
Azure Data Factory
As mentioned, we will use Azure Data Factory to ingest data from the API into our ADLS. So it’s a nice moment to go to your Azure account and create a Data Factory (remember that you need an active subscription and also a resource group for this).
Azure Data Lake Storage Gen 2 (ADLS)
Also, we will need an intermediate storage to keep the API responses in .json files that we’ll ingest into Snowflake. For this, we’ll create an ADLS. You can do this inside Storage Account service in Azure.
Working in Azure Data Factory
If you are familiar with Data Factory, you probably know what steps to follow. But maybe this is not the case, so let’s go step-by-step.
Linked Services and Datasets
Before connecting to our ADLS or Snowflake, we have to create some Linked Services - basically, the equivalent of a connection string. Linked Services are services (of course) that contains the information on how to connect to external resources (ADLS, Snowflake, AWS…).
After creating a linked service, we’ll have to create individual datasets for our data.
So, in your Data Factory, go ahead to Manage → Linked Services → Create Linked Service. We have to create one linked service for Snowflake, another one for our ADLS and a last one for our API.
Snowflake Linked Service
When prompted with “Data Store” pick Snowflake V2. For your Snowflake Linked Service (I named mine, LS_Snowflake) you will need the following information:
Account Name (XXXXXXX-YYYYYYY)
Warehouse
Database
Authentication Type
User/Password
Role
{
"name": "LS_Snowflake",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"annotations": [],
"type": "SnowflakeV2",
"typeProperties": {
"authenticationType": "Basic",
"accountIdentifier": <account-identifier>,
"user": <user>,
"database": <database>,
"warehouse": <warehouse>,
"role": <role>,
"encryptedCredential": ""
},
"version": "1.1"
}
}Notes on Linked Services
To follow best practices, you would need to parameterize this linked service so you don't need to create one linked service per database or warehouse. But to keep it simple, I will just target this one to my raw layer.
Also, user/password should be safely stored in Azure Key Vault - again, I'll just hardcode it to go faster.
Finally, ideally you are not using an ultra powerful user with ACCOUNTADMIN role, but you should create a role with only the required permissions.Once you have introduced all that information in the input boxes, go ahead and test the connection. If it is successful, save it.
ADLS Linked Service
Similarly to what we did for the Snowflake Linked Service, go ahead and create a linked service for Azure Data Lake Storage (I called mine LS_ADLS). Now, pick a subscription and pick a storage account as well.
{
"name": "LS_ADLS",
"type": "Microsoft.DataFactory/factories/linkedservices",
"properties": {
"annotations": [],
"type": "AzureBlobFS",
"typeProperties": {
"url": "https://<your-storage-account>.dfs.core.windows.net/",
"encryptedCredential": ""
}
}
}Rest API Linked Service
Similarly, to copy the results from the API call into a json we’ll need to create a linked service for the Poke API. In this Linked Service you have to introduce the base url (https://pokeapi.co/api/v2/) and since it is a public API, you can set up authentication as Anonymous.
{
"name": "LS_PokeAPI",
"properties": {
"annotations": [],
"type": "RestService",
"typeProperties": {
"url": "https://pokeapi.co/api/v2/",
"enableServerCertificateValidation": true,
"authenticationType": "Anonymous"
}
}
}Rest API Dataset
Now we are ready to create a dataset. While a linked service will connect our ADF to the external resource, a dataset will be the actual data object containing the information.
In the Author tab, click Datasets (under Pipelines and CDC). Then New Dataset → Rest → Pick your linked service. At this point your dataset for the REST API is almost ready, we still have to create a parameter. Remember that each pokemon requires an individual API call - so we need to parameterize this linked service and the parameter will be the pokemon name. Then, modify the relative URL field to target the proper endpoint by dynamically concatenating the endpoint + pokemon_name (see Endpoints used in the API section above, on top of this post!!!).
{
"name": "rest_pokemon_dataset",
"properties": {
"linkedServiceName": {
"referenceName": "LS_PokeAPI",
"type": "LinkedServiceReference"
},
"parameters": {
"pokemon_name": {
"type": "string"
}
},
"annotations": [],
"type": "RestResource",
"typeProperties": {
"relativeUrl": {
"value": "@concat('pokemon/',dataset().pokemon_name)",
"type": "Expression"
}
},
"schema": []
}
}ADLS Dataset
Following the same approach, let’s create an ADLS dataset in our Datasets section. Configure it by selecting your ADLS linked service and then providing the container name, directory and, again, parameterize the file name. Remember that the file name will have to end in .json, so as a file name you want to concatenate the parameter + .json extension.
{
"name": "poke_dataset",
"properties": {
"linkedServiceName": {
"referenceName": "LS_ADLS",
"type": "LinkedServiceReference"
},
"parameters": {
"file_name": {
"type": "string"
}
},
"annotations": [],
"type": "JSON",
"typeProperties": {
"location": {
"type": "AzureBlobFSLocation",
"fileName": {
"value": "@concat(dataset().file_name,'.json')",
"type": "Expression"
},
"folderPath": "raw",
"fileSystem": "pokemondata"
},
"compressionCodec": "snappy"
},
"schema": []
}
}Data Factory Pipeline
For the pipeline, we will follow the structure below:
Web Activity → With a web activity we will target the Pokemon API and get a list of the 151 pokemons. The response will be a JSON containing pokemon name and API Url for that pokemon.
The Web activity has to be configured with the url https://pokeapi.co/api/v2/pokemon?limit=151, method GET and Authentication None because this is a public API.
ForEach Activity → A For Each activity will iterate through every item in the API result and call the specific pokemon endpoint to get their information.
Copy → The Copy activity will take the output from our request (performed through the dataset derived from the REST API linked service) and dump it into a json file. The configuration for the Copy activity will be Source: REST API Dataset, Sink: ADLS Dataset. The parameters that you need to introduce for the source and the sink will be the item name (a.k.a the Pokemon Name)
Data Factory Pipeline JSON:
Here you have the final version of the pipeline in JSON format. If you struggle defining these objects on your own, feel free to copy-paste the definitions. They should work :)
{
"name": "Pokemon_data_to_JSON",
"properties": {
"activities": [
{
"name": "Get_Pokemon_Data",
"type": "WebActivity",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"method": "GET",
"url": "https://pokeapi.co/api/v2/pokemon?limit=151"
}
},
{
"name": "ForEach1",
"type": "ForEach",
"dependsOn": [
{
"activity": "Get_Pokemon_Data",
"dependencyConditions": [
"Succeeded"
]
}
],
"userProperties": [],
"typeProperties": {
"items": {
"value": "@activity('Get_Pokemon_Data').output.results",
"type": "Expression"
},
"activities": [
{
"name": "Copy_pokemon_data_to_json",
"type": "Copy",
"dependsOn": [],
"policy": {
"timeout": "0.12:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"userProperties": [],
"typeProperties": {
"source": {
"type": "RestSource",
"httpRequestTimeout": "00:01:40",
"requestInterval": "00.00:00:00.010",
"requestMethod": "GET",
"paginationRules": {
"supportRFC5988": "true"
}
},
"sink": {
"type": "JsonSink",
"storeSettings": {
"type": "AzureBlobFSWriteSettings"
},
"formatSettings": {
"type": "JsonWriteSettings"
}
},
"enableStaging": false
},
"inputs": [
{
"referenceName": "rest_pokemon_dataset",
"type": "DatasetReference",
"parameters": {
"pokemon_name": {
"value": "@item().name",
"type": "Expression"
}
}
}
],
"outputs": [
{
"referenceName": "pokemon_dataset",
"type": "DatasetReference",
"parameters": {
"file_name": {
"value": "@item().name",
"type": "Expression"
}
}
}
]
}
]
}
}
],
"annotations": []
}
}Snowflake set-up
Generating a SAS for Snowflake Access
Despite being a Snowflake project, we have spent quite some time in Azure haven’t we? We are almost ready to jump into Snowflake. One last thing!
Before ingesting the data, we need to generate some credentials to let Snowflake access our data. For this, you’ll need to go to your storage account and follow the next steps:
Storage Account → Security + networking → Shared access signature
Configure the SAS to give Read and List permissions to Containers and Objects. Then adjust the start and expiration time for this SAS and generate it. After you generate it, save it. You will need it soon.
Creating an external stage
Now, making use of our recently created SAS and our storage account + container link, let’s create a external storage to let Snowflake know where the data files are living:
CREATE OR REPLACE STAGE POKEMON_DB.RAW.POKEMON_STAGE
URL= 'azure://<storage_account>.blob.core.windows.net/pokemondata/raw/'
CREDENTIALS = (AZURE_SAS_TOKEN=<your_sas_token>)
file_format = (type = 'json');To verify that this worked correctly, you might want to execute
LIST @POKEMON_STAGE;And this should return a list of the .json files that are living in your raw container.
Copying raw data into tables
Now it’s time to perform the data ingestion from Azure into Snowflake. If you analyze a little bit the data:
SELECT $1 from @pokemon_stage/bulbasaur.jsonYou will see that this API request returns a lot of data that, for sake of simplicity, we will ignore. As I showed on the script on top, we will keep just the id, name, height, weight and the moves.
COPY INTO POKEMON_DB.RAW.RAW_POKEMON FROM (
SELECT
$1:id,
$1:name,
$1:height,
$1:weight,
$1:moves
from @pokemon_stage);Now if you query RAW_POKEMON table, you will see that the data for all pokemons has been loaded successfully.
After loading this data, we might want to clean it a little bit, because we have some structured data with some unstructured data… let’s go and create a new schema and two tables:
CREATE OR REPLACE SCHEMA POKEMON_DB.PROCESSED;
CREATE OR REPLACE TABLE POKEMON_DB.PROCESSED.POKEMON_DATA (
POKEMON_ID INT,
POKEMON_NAME STRING,
HEIGHT INT,
WEIGHT INT
);
CREATE OR REPLACE TABLE POKEMON_DB.PROCESSED.POKEMON_MOVES (
POKEMON_ID INT,
MOVE_NAME STRING);As you can imagine, we will explode our complex moves column into an analytical table containing pokemon_id and move, this way we can see what pokemons can learn what moves. Also, we’ll keep basic pokemon data (name, height, weight) in a different table.
For the POKEMON_DATA table, the query is quite straightforward:
INSERT INTO PROCESSED.POKEMON_DATA
SELECT POKEMON_ID, POKEMON_NAME, HEIGHT, WEIGHT FROM RAW.RAW_POKEMON;And for the second one, since we have to explode the moves column into multiple rows, we’ll need to execute the following query:
INSERT INTO PROCESSED.POKEMON_MOVES
SELECT
POKEMON_ID,
REPLACE(v.value:move:name,'""','') AS MOVE FROM RAW.RAW_POKEMON,
LATERAL FLATTEN(INPUT => moves) v;The REPLACE is just removing some quoting around the move name.
So, at this point you should have a small data model where POKEMON_DATA can be connected to POKEMON_MOVES via pokemon_id and get a ready-to-analyze table!
Quite some work done here right?
Summary of the project
At this point of the project we have done lots of things:
1 - Explored an API to analyze the response format
2 - Created necessary objects in Snowflake (raw tables) to ingest data from the API
3 - Created a pretty cool pipeline in Azure Data Factory - using linked services, datasets and web activities.
4 - Loaded the data into JSON files in our storage account
5 - Created an external stage in Snowflake to load our raw data
6 - Loaded the raw data and created two processed tables
I hope you enjoyed this post and learned a lot on how to move data around different platforms.
As always, remember to subscribe if you found this useful! See you soon :)

