Glue - Official FAQ
The official doc for troubleshooting could be found here
Troubleshooting - Notes & Tips
Notes and tips for Glue when implementing the ETL process:
Naming rules / conventions for AWS services
- S3 bucket name can either NOT uppercase nor NOT contain “_”
- Dynamo DB table / Name can NOT contain “-“
- ES / Name: can NOT contain “-“
ACM SSL Cert
Using us-east-1
region for AWS CloudFront (certificate)
ElasticSearch Schema
Updated the ES mappings in all environments so that field A searches are now case insensitive and will work with spaces in A names
e.g. Fixed the issue for field A so searches are now case insensitive and will work with spaces in A names
1 | query='{"query":{"terms":{"name":["gift vouchers"]}}}' |
e.g. Fixed the issue for field B with spaces not returning quotes:
1 | { |
Publish Glue scripts to S3
AWS S3 wildcards for copying scripts to bucket:
1 | aws s3 sync . s3://$(S3_BUCKET)/$(FUNCTION_NAME)_Glue/ --recursive --exclude "*" --include "glue_*.py" |
Glue - Crawler - DependsOn
If not using DependsOn: Connection
in Glue Crawler, it won’t create Resource Connection
before Crawler
.
And the error message is as follows:
1 | ===> JDBC Connection not registered: (Service: AWSGlue; Status Code: |
Code in AWS CloudFormation
template:
1 |
|
ES Timeout issue - using NAT gateway
Background
ES use Internet Endpoint (not VPC Endpoint)
Can’t use VPC endpoint
for search services - it must be accessible from AWS AppSync
So when creating Glue connection, it would connect to DB
AWS Official Doc
All JDBC data stores that are accessed by the job must be available from the VPC subnet.
If your job needs to access both VPC resources and the public internet, the VPC needs to have a Network Address Translation (NAT) gateway inside the VPC.
The network interface is not assigned any public IP addresses. AWS Glue requires internet access (for example, to access AWS services that don’t have VPC endpoints). You can configure a network address translation (NAT) instance inside your VPC, or you can use the Amazon VPC NAT gateway.
Adding Glue Subnet
So a specific Glue subnet with public access is needed.
DB Security Group
For the DB need to add self ref SG
, otherwise it’ll get ERROR as follows:
1 | Error: Inbound Rule in Security Group Required |
AWS Doc:
At least one security group must open all ingress ports. To limit traffic, the source security group in your inbound rule can be restricted to the same security group.
Code Example (Security Group)
1 | ADBSecurityGroup: |
Code Example (Glue Cloudformation Template)
1 | # Glue Job |
Reference
- https://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
- https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html
- https://docs.aws.amazon.com/cli/latest/reference/s3/index.html#use-of-exclude-and-include-filters
- https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-inbound-self-reference-rule