Introduction
It's important to describe your infrastructure with a code. Terraform can help us with that.
Authentication
Don't forget to create variables.tf file in your project root directory where you should set 3 variables:
region
- where all your infrastructure will be deployed access_key
and secret_key
for your user which can be generated via AWS IAM (examples are below)
variable "region" {
default = "us-east-2"
}
variable "access_key" {
default = "JFSKLGD8...UFDJKGJS"
}
variable "secret_key" {
default = "sdfs8d9fgEG33VE...343rVFDV3vdfevr"
}
Step by Step Scripts
After passing exam AWS Solutions Architect Associate not to forget the stuff, I found out projects which AWS suggests at their getting started section to implement them one by one. I chose Analyze Big Data with Hadoop for the first step. For fun, I decided to describe this project via Terraform scripts.
I'd like to share this experience because I faced a couple of not trivial issues.
- First of all, we need to set up Terraform provider, see provider.tf.
provider "aws" {
access_key = "${var.access_key}"
secret_key = "${var.secret_key}"
region = "${var.region}"
}
- Here, we should create S3 Bucket and EC2 Key Pair. Both are quite simple and straightforward steps which are described at s3_bucket.tf and key_pair.tf correspondingly.
resource "aws_s3_bucket" "s3_bucket" {
bucket = "tf-big-data"
}
resource "aws_key_pair" "emr_key_pair" {
key_name = "tf-big-data"
public_key = "ssh-rsa A...w== rsa-key-20180822"
}
- Creating EMR cluster via the console needs 5-7 clicks choosing a couple of options and the rest of the options can be left by default. It looks like an apple pie but in fact a lot of actions are happening behind the scenes. So we have to take care about the roles and policies for EMR and its EC2 instances. For each of them, we have to create 2 data objects (
aws_iam_policy
and aws_iam_policy_document
) and 2 resources (aws_iam_role_policy_attachment
and aws_iam_role
). These roles are at roles.tf module. - Another important section is about network and security (vpc.tf). Here, we're creating 6 resources:
aws_vpc
; aws_subnet
and aws_internet_gateway
at this vpc; aws_route_table
at this vpc which has a route via created internet gateway; aws_main_route_table_association
which connects our vpc and route table; aws_security_group
at our vpc which depends on created subnet.
- We also need an
aws_iam_instance_profile
which is kept at the end of emr_cluster.tf module. - Finally, we can create EMR cluster itself emr_cluster.tf. We should describe here all required properties such as:
name
, release_label
, applications
, service_role
(from step 3), log_uri
(from step 2), ec2_attributes
(from steps 2, 4, 5), one or more instance groups. I also added there 'step' section where I put Hive-script to execute.
The full code of the project is here.
I will really appreciate any comments or suggestions about how this script could be simplified.
Points of Interest
It's not obvious how many resources are really created behind the scenes when you click the button to create EMR cluster at AWS Console. But it's useful to know to understand underlying things that are happening there.