7 Things I Found Annoying About AWS Glue

Ness Phan
5 min readMay 10, 2020

1. Extremely slow start times

https://medium.com/capital-one-tech/aws-glue-an-etl-solution-with-huge-potential-91a04a2a0712

The number one most annoying thing about Glue is that the startup times are extremely slow. It can take up to 20 minutes to start up a Glue job (but can take a little less time if you had run it recently) and that is not counting the time it takes to actually run the job. Compare that to the startup time of GCP’s Dataproc which typically takes around 60–90 seconds. This means that debugging a Glue job can often be a long, arduous process in which half of your time is just spent waiting for the job to start running. This also means that if you were proud getting your spark job down to a speedy 5 minutes, it could actually end up taking a total of 25 minutes to run.

Check it out — A 10 minute startup time for a 1 minute long job :|

2. No obvious way to launch in VPC

Another thing I found annoying was that if you wanted to launch your Glue job in a VPC to talk to an EC2 instance that you also had launched in your VPC, there is no obvious way to do it. You can’t simply provide a subnet as a Glue job argument when you launch the job. Instead, you have to create a fake JDBC AWS Glue Connection (by fake I mean, you can specify anything for the JDBC URL because you will not actually be using the connection in your Glue job) and then launch your Glue job with the Glue Connection attached to it. I went over how to do this in my article here.

You can create any garbage JDBC URL string

3. Sparse Documentation

My issue with number two was also a result of this next annoyance — sparse documentation. The VPC issue would have not been so bad had it been documented, but the only way I was able to figure it out was to contact AWS Support. It seems like due to the lack of Glue popularity, usually Stack Overflow did not even have the solution to most problems. However, even simple things like the CLI documentation for Glue seemed to have less TLC than the CLI documentation for other AWS managed services (like AWS Elasticsearch) which often have examples of CLI usage and output, something that the Glue documentation does not have.

Other AWS services had rich documentation such as examples of CLI usage and output, whereas AWS Glue did not.

4. No ability to name jobs

The inability to name jobs was also a large annoyance since it made it difficult to distinguish between two Glue jobs. Glue only distinguishes jobs by Run ID which looks like this in the GUI:

Incredibly not obvious which dataset is failing here

But imagine you have a Glue job to load different s3 datasets into Redshift and you need to launch a Glue job for each s3 dataset to load. By just looking at the GUI, it is hard differentiating which Glue run belongs to which dataset. If a Glue job failed, the only way to figure out which dataset it belonged to would be to click the Logs or Error Logs and ctrl+F for your dataset.

Small things like this don’t seem to be a big deal initially, but in the long run they slow down development time due to the extra clicking and also create friction in debugging and analyzing job runs.

5. Bad UI

Here’s a screenshot of the bad UI:

Nice.

Here are reasons why the above screenshot is annoying:

  • The job runs window pop out at the bottom and only expand maximum to half the screen
  • The job run information is all squished together horizontally with no ability to expand each column size to see more of the information (you can hover to see more information but wouldn’t it be easier if you could just have all the job run information displayed without having to hover over each column and row?
  • Speaking of hover, let’s just hope that if you do hover, this doesn’t happen:
An example of me hovering over the “Job run input” column (last column). Check out how the hovered information is cut off.

6. Glue job failing with “Resource Unavailable”

Sometimes the jobs will take incredibly long to start up, only to fail with “Resource Unavailable”. This error is an internal AWS error and occurs when AWS does not have enough resources on their end to run your Glue job in the region. The only way to solve this is to wait and try again in a couple of hours or maybe the next day in hopes that it will work. There are also some other obscure and non obvious ways to get by this issue detailed in this Stack Overflow link. But this seems to be an ongoing AWS issue that they have yet to fix.

What does “after sometime” mean?

7. Sometimes you need to set “--conf” flag even though docs say to NOT set it.

Your Glue job might have failed with this error before:

Container killed by YARN for exceeding memory limits. 5.7 GB of 5.5 GB physical memory used.

I had this issue several times and the way I was able to fix it was to increase the memory detailed here. This fix involves setting the “--conf” flag which they say in the official Glue documentation not to set.

Taken from the official Glue documentation

However, AWS support will tell you to correct the issue by increasing the memory via the “--conf” flag.

AWS team recommending to set the --conf flag (source)

Conclusion

While I was developing with Glue, I ran into many small problems that all contributed to a painful development process and friction in debugging and usage. These ranged anything from small user experience annoyances, to contradictory documentation and advice, to large holes in documentation on how to do basic things. The service would be fine as a part of non-critical workflows or a solution to small problems, but I would say proceed with caution if you want to use it in critical, production services.

--

--

Ness Phan

Software Engineer with a passion for making music, art and writing about software