AWS Glue is a serverless knowledge integration service that permits you to course of and combine knowledge coming via totally different knowledge sources at scale. AWS Glue 5.0, the most recent model of AWS Glue for Apache Spark jobs, offers a performance-optimized Apache Spark 3.5 runtime expertise for batch and stream processing. With AWS Glue 5.0, you get improved efficiency, enhanced safety, assist for the subsequent technology of Amazon SageMaker, and extra. AWS Glue 5.0 allows you to develop, run, and scale your knowledge integration workloads and get insights sooner.
AWS Glue accommodates numerous improvement preferences via a number of job creation approaches. For builders preferring direct coding, Python or Scala improvement is out there utilizing the AWS Glue ETL library.
Constructing production-ready knowledge platforms requires strong improvement processes and steady integration and supply (CI/CD) pipelines. To assist numerous improvement wants—whether or not on native machines, Docker containers on Amazon Elastic Compute Cloud (Amazon EC2), or different environments—AWS offers an official AWS Glue Docker picture via the Amazon ECR Public Gallery. The picture permits builders to work effectively of their most well-liked surroundings whereas utilizing the AWS Glue ETL library.
On this submit, we present easy methods to develop and take a look at AWS Glue 5.0 jobs regionally utilizing a Docker container. This submit is an up to date model of the submit Develop and take a look at AWS Glue model 3.0 and 4.0 jobs regionally utilizing a Docker container, and makes use of AWS Glue 5.0 .
Obtainable Docker pictures
The next Docker pictures can be found for the Amazon ECR Public Gallery:
- AWS Glue model 5.0 –
ecr.aws/glue/aws-glue-libs:5
AWS Glue Docker pictures are suitable with each x86_64
and arm64
.
On this submit, we use public.ecr.aws/glue/aws-glue-libs:5
and run the container on a neighborhood machine (Mac, Home windows, or Linux). This container picture has been examined for AWS Glue 5.0 Spark jobs. The picture incorporates the next:
To arrange your container, you pull the picture from the ECR Public Gallery after which run the container. We show easy methods to run your container with the next strategies, relying in your necessities:
spark-submit
- REPL shell (
pyspark
) pytest
- Visible Studio Code
Stipulations
Earlier than you begin, be sure that Docker is put in and the Docker daemon is working. For set up directions, see the Docker documentation for Mac, Home windows, or Linux. Additionally just be sure you have not less than 7 GB of disk house for the picture on the host working Docker.
Configure AWS credentials
To allow AWS API calls from the container, arrange your AWS credentials with the next steps:
- Create an AWS named profile.
- Open cmd on Home windows or a terminal on Mac/Linux, and run the next command:
Within the following sections, we use this AWS named profile.
Pull the picture from the ECR Public Gallery
Should you’re working Docker on Home windows, select the Docker icon (right-click) and select Swap to Linux containers earlier than pulling the picture.
Run the next command to tug the picture from the ECR Public Gallery:
Run the container
Now you’ll be able to run a container utilizing this picture. You possibly can select any of following strategies based mostly in your necessities.
spark-submit
You possibly can run an AWS Glue job script by working the spark-submit
command on the container.
Write your job script (pattern.py
within the following instance) and put it aside below the /local_path_to_workspace/src/
listing utilizing the next instructions:
These variables are used within the following docker run
command. The pattern code (pattern.py
) used within the spark-submit
command is included within the appendix on the finish of this submit.
Run the next command to run the spark-submit
command on the container to submit a brand new Spark utility:
REPL shell (pyspark)
You possibly can run a REPL (read-eval-print loop) shell for interactive improvement. Run the next command to run the pyspark command on the container to start out the REPL shell:
You will notice following output:
With this REPL shell, you’ll be able to code and take a look at interactively.
pytest
For unit testing, you should use pytest
for AWS Glue Spark job scripts.
Run the next instructions for preparation:
Now let’s invoke pytest
utilizing docker run
:
When pytest
finishes executing unit exams, your output will look one thing like the next:
Visible Studio Code
To arrange the container with Visible Studio Code, full the next steps:
- Set up Visible Studio Code.
- Set up Python.
- Set up Dev Containers.
- Open the workspace folder in Visible Studio Code.
- Press Ctrl+Shift+P (Home windows/Linux) or Cmd+Shift+P (Mac).
- Enter
Preferences: Open Workspace Settings (JSON)
. - Press Enter.
- Enter following JSON and put it aside:
Now you’re able to arrange the container.
- Run the Docker container:
- Begin Visible Studio Code.
- Select Distant Explorer within the navigation pane.
- Select the container
ecr.aws/glue/aws-glue-libs:5
(right-click) and select Connect in Present Window.
- If the next dialog seems, select Received it.
- Open
/residence/hadoop/workspace/
.
- Create an AWS Glue PySpark script and select Run.
It is best to see the profitable run on the AWS Glue PySpark script.
Modifications between the AWS Glue 4.0 and AWS Glue 5.0 Docker picture
The next are main adjustments between the AWS Glue 4.0 and Glue 5.0 Docker picture:
- In AWS Glue 5.0, there’s a single container picture for each batch and streaming jobs. This differs from AWS Glue 4.0, the place there was one picture for batch and one other for streaming.
- In AWS Glue 5.0, the default person title of the container is hadoop. In AWS Glue 4.0, the default person title was glue_user.
- In AWS Glue 5.0, a number of extra libraries, together with JupyterLab and Livy, have been faraway from the picture. You possibly can manually set up them.
- In AWS Glue 5.0, all of Iceberg, Hudi, and Delta libraries are pre-loaded by default, and the surroundings variable
DATALAKE_FORMATS
is now not wanted. Till AWS Glue 4.0, the surroundings variableDATALAKE_FORMATS
was used to specify whether or not the particular desk format is loaded.
The previous checklist is particular to the Docker picture. To study extra about AWS Glue 5.0 updates, see Introducing AWS Glue 5.0 for Apache Spark and Migrating AWS Glue for Spark jobs to AWS Glue model 5.0.
Concerns
Remember the fact that the next options usually are not supported when utilizing the AWS Glue container picture to develop job scripts regionally:
Conclusion
On this submit, we explored how the AWS Glue 5.0 Docker pictures present a versatile basis for growing and testing AWS Glue job scripts in your most well-liked surroundings. These pictures, available within the Amazon ECR Public Gallery, streamline the event course of by providing a constant, transportable surroundings for AWS Glue improvement.
To study extra about easy methods to construct end-to-end improvement pipeline, see Finish-to-end improvement lifecycle for knowledge engineers to construct a knowledge integration pipeline utilizing AWS Glue. We encourage you to discover these capabilities and share your experiences with the AWS group.
Appendix A: AWS Glue job pattern codes for testing
This appendix introduces three totally different scripts as AWS Glue job pattern codes for testing functions. You should utilize any of them within the tutorial.
The next pattern.py code makes use of the AWS Glue ETL library with an Amazon Easy Storage Service (Amazon S3) API name. The code requires Amazon S3 permissions in AWS Identification and Entry Administration (IAM). You could grant the IAM-managed coverage arn:aws:iam::aws:coverage/AmazonS3ReadOnlyAccess or IAM customized coverage that permits you to make ListBucket and GetObject API requires the S3 path.
The next test_sample.py code is a pattern for a unit take a look at of pattern.py:
Appendix B: Including JDBC drivers and Java libraries
So as to add a JDBC driver not at the moment out there within the container, you’ll be able to create a brand new listing below your workspace with the JAR recordsdata you want and mount the listing to /decide/spark/jars/
within the docker run
command. JAR recordsdata discovered below /decide/spark/jars/
inside the container are robotically added to Spark Classpath and will probably be out there to be used throughout the job run.
For instance, you should use the next docker run
command so as to add JDBC driver jars to a PySpark REPL shell:
As highlighted earlier, the customJdbcDriverS3Path
connection possibility can’t be used to import a customized JDBC driver from Amazon S3 in AWS Glue container pictures.
Appendix C: Including Livy and JupyterLab
The AWS Glue 5.0 container picture doesn’t have Livy put in by default. You possibly can create a brand new container picture extending the AWS Glue 5.0 container picture as the bottom. The next Dockerfile demonstrates how one can lengthen the Docker picture to incorporate extra elements it’s worthwhile to improve your improvement and testing expertise.
To get began, create a listing in your workstation and place the Dockerfile.livy_jupyter
file within the listing:
The next code is Dockerfile.livy_jupyter
:
Run the docker construct command to construct the picture:
When the picture construct is full, you should use the next docker run command to start out the newly constructed picture:
Appendix D: Including further Python libraries
On this part, we talk about including further Python libraries and putting in Python packages utilizing
Native Python libraries
So as to add native Python libraries, place them below a listing and assign the trail to $EXTRA_PYTHON_PACKAGE_LOCATION
:
To validate that the trail has been added to PYTHONPATH
, you’ll be able to verify for its existence in sys.path
:
Putting in Python packages utilizing pip
To put in packages from PyPI (or some other artifact repository) utilizing pip, you should use the next method:
In regards to the Authors
Subramanya Vajiraya is a Sr. Cloud Engineer (ETL) at AWS Sydney specialised in AWS Glue. He’s obsessed with serving to clients resolve points associated to their ETL workload and implementing scalable knowledge processing and analytics pipelines on AWS. Exterior of labor, he enjoys occurring bike rides and taking lengthy walks along with his canine Ollie.
Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue workforce. He works based mostly in Tokyo, Japan. He’s liable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his street bike.