Nsjail failing - can't use custom Python libraries

We're keen to use custom Python libraries in our workflows, but I've noticed that nsjail (which the functionality depends on) is failing to launch in our containers:

Running code executor container in privileged mode.
iptables: Disallowing link-local addresses
{"level":"info","message":"Not configuring StatsD...","timestamp":"2024-07-13T10:11:42.432Z"}
{"level":"info","message":"Installing request logging middleware...","timestamp":"2024-07-13T10:11:42.438Z"}
{"level":"info","message":"Checking if nsjail can be used","timestamp":"2024-07-13T10:11:42.439Z"}
{"level":"info","message":"Populating environment cache","timestamp":"2024-07-13T10:11:42.442Z"}
{"dd":{"env":"production","service":"code_executor_service","span_id":"7468107029411779173","trace_id":"7468107029411779173"},"level":"info","message":"Done populating","timestamp":"2024-07-13T10:11:42.449Z"}
{"level":"info","message":"Starting code executor on port 3004","timestamp":"2024-07-13T10:11:42.451Z"}
{"dd":{"env":"production","service":"code_executor_service","span_id":"3091449606178479480","trace_id":"678313442480844015"},"level":"error","message":"[E][2024-07-13T10:11:42+0000][24] standaloneMode():275 Couldn't launch the child process","timestamp":"2024-07-13T10:11:42.482Z"}
{"dd":{"env":"production","service":"code_executor_service","span_id":"678313442480844015","trace_id":"678313442480844015"},"level":"info","message":"cleaning up job - /tmp/jobs/83f43ad1-c96b-43f1-9936-dc3c266fb546","timestamp":"2024-07-13T10:11:42.485Z"}
{"dd":{"env":"production","service":"code_executor_service","span_id":"678313442480844015","trace_id":"678313442480844015"},"level":"error","message":"[E][2024-07-13T10:11:42+0000][24] standaloneMode():275 Couldn't launch the child process","timestamp":"2024-07-13T10:11:42.489Z"}
{"level":"info","message":"can use nsjail: false","timestamp":"2024-07-13T10:11:42.490Z"}

We are using the helm chart (v 6.2.5) to deploy Retool into our EKS environment (EC2, not Fargate), and other than this we have been using the platform without issue.

Our nodes are running BOTTLEROCKET_x86_64.

Are there any other settings that we need to be aware of to get this working?

1 Like

FYI - we are using Retool 3.52.9-stable

@Tess are you or one of the team able to point us in the right direction? We'd love to be able to leverage PyArrow in our workflows.

Hello @bish!

Tess is out of the office today. The nsjail is built out inside the 'Code Executor' container, I would see if the log tools for that container has more details on why it is giving the 'Couldn't launch the child process' error.

Unfortunately I haven't seen this bug before.

Maybe some other users might be able to chime in.

For troubleshooting helm/EKS/container bugs, the debugging method usually involves going through any and all changes made from the original template cloned from github to see where the code might have diverged from what should be the working default.

This would be much easier to do during our office hours as we will need to be asking to see helm charts and logs, with lots of back and forth which is easier to do live :sweat_smile:

Hey @Jack_T, after a bit of debugging I can see that's it's related to a setting in Bottlerocket that disables user namespaces by default - this appears to be required for nsjail to spin up.

In my configuration I'm able to enable this and this seems to allow nsjail to spin up:

Running code executor container in privileged mode.
iptables: Disallowing link-local addresses
{"level":"info","message":"Not configuring StatsD...","timestamp":"2024-07-25T19:53:16.689Z"}
{"level":"info","message":"Installing request logging middleware...","timestamp":"2024-07-25T19:53:16.708Z"}
{"level":"info","message":"Checking if nsjail can be used","timestamp":"2024-07-25T19:53:16.711Z"}
{"level":"info","message":"Populating environment cache","timestamp":"2024-07-25T19:53:16.714Z"}
{"dd":{"env":"production","service":"code_executor_service","span_id":"3957065433957739717","trace_id":"3957065433957739717"},"level":"info","message":"Done populating","timestamp":"2024-07-25T19:53:16.724Z"}
{"level":"info","message":"Starting code executor on port 3004","timestamp":"2024-07-25T19:53:16.727Z"}
{"dd":{"env":"production","service":"code_executor_service","span_id":"8269636041712581257","trace_id":"8269636041712581257"},"level":"info","message":"cleaning up job - /tmp/jobs/16037338-e69a-4974-bf1f-3148ee27626b","timestamp":"2024-07-25T19:53:16.907Z"}
{"level":"info","message":"can use nsjail: true","timestamp":"2024-07-25T19:53:16.913Z"}

However, I'm still not seeing the ability to add custom python libraries via the 'Modify requirements.txt' in my workflows - am I missing something else?

image

@Jack_T hoping you can also advise on another side effect of code sandboxing - code blocks that rely on boto / AWS resources are now failing because they aren't picking up the container AWS env variables that EKS IRSA utilises for cross-account access.

What is the recommended approach for this?

Hello @bish,

For the library, did clicking on the 'Add python library' in the screen shot you shared not work?

I just had another user tell me they were able to add a JS library just by adding 'require' to the code block. Let me know if adding 'import library' to the top of the block works :crossed_fingers:

If the 'add library' and import both don't work, you can use the 'setup script' option by clicking the :gear: on the left side bottom which will give you the option to add in code that will run before every python/JS block.

Where is the 'modify requirements.txt' you are describing?

Will have to double check with our deployments team, I am not familiar with IRSA :thinking:

Hi @Jack_T, thanks for getting back to me. That would be great if you could ask the deployments team about leveraging IRSA in sandboxed code.

With regards to the adding of custom libraries, I'm referring to the instructions here: Execute Python with the Code block | Retool Docs

If you look at the example in the docs compared to my screenshot, I'm missing 'Modify requirements.txt' and I'm also missing that settings icon you also mentioned!

This is what the docs show:
image

This is my UI:
image

No problem!

Looks like the docs team is lagging behind on their updates :sweat_smile: I will let them know they should update those as well as specify for older version of Retool where users can find that 'requirements.txt' is, as it seems we have moved it several times...

What version of retool are you running? My UI on the latest cloud release also looks different. Seems we are doing a lot of moving components around. Once we get synched up on what version you have I can better help you find the start script.

If you click 'Add Python Library' it should automatically update the requirements.txt file. Which should run along side the start script and be good to go. I am realizing this feels like overkill so will also let the eng team know if users need to add libs in all of these or just one :melting_face:

For the deployment question, here is what they are having me ask you:

"What types of Retool Resource(s) are impacted? Did they see this issue after updating their Retool instance.

If so, what were the starting and ending versions in the update?

Did they make any changes to the Resources, to their Retool deployment or to the relevant AWS services before they started seeing this issue?"

@Jack_T I'm currently on 3.68.0-edge. I can't add any python libs that aren't already pre-installed, which is what this feature is supposed to enable.

With regards to the deployment, the python code blocks in workflows stopped working as soon as I got nsjail working and the code started executing in sandboxed mode.

I guess it's worth pointing out that we use these python code blocks to talk to AWS APIs and utilise boto quite heavily for this. Before being sandboxed boto used the credentials provided to the container through IRSA. It looks like the sandboxed code doesn't / can't use these and are now failing.

Here's some example code that works in unprivileged mode, but fails in sandboxed:

import boto3
import base64

kms_client = boto3.client('kms')

response = kms_client.encrypt(
    KeyId='alias/keyid',
    Plaintext='Hello World'
)

encrypted_data = response['CiphertextBlob']

print(base64.b64encode(encrypted_data).decode('utf-8'))

When sandboxed, it no longer uses the env variables provided to the container that are necessary for IRSA.

error NoRegionError: You must specify a region. (line 4)

Here are the env variables provided to the container that are required for IRSA:

Hi @bish!

Very odd issue, it looks like you have an AWS env vars set correctly.

You have probably already tried this but worth a shot, have you tried restarting your containers?

It sounds like the libraries were working previously, which makes it odd that the code executor isn't able to grab the env vars it needs. Let me ask our team if it is possible to give nsjail the AWS permissions needed to get the env vars.

There is a chance it is not possible and I will look for workarounds as you mentioned you need boto3 to run your processes.

Just heard back from our eng team,

"The sandbox does not get the environment variables set by the parent pod.

Before they used code executor, they used backend execution which would run the code without a sandbox, giving it access to the top-level environment vars.

Code run in nsjail does not get access to all the environment variables. I'd recommend using config vars or hardcoding the values in setup scripts"

And for where to put the env vars:

"In the workflow editor, on the left there is a settings cog named "Settings".

Inside there should be "Python configuration". Inside that is the setup scripts where they can define code to be run before every code block.

Hi @Jack_T,

Thanks for reporting back. We still have the issue where I don't see the settings cog or the 'Modify requirements.txt' in our UI, despite the sandbox code execution clearly being enabled - so I wouldn't be able to create setup scripts at the moment.

Whilst I appreciate the workaround, is it possible a more elegant / robust solution could be introduced? IAM roles for service accounts (IRSA) are the primary way of specifying the role / permissions in which applications running in a pod's container should assume when running. Having to manually pass in credentials defeats the point slightly and introduces extra overhead.

Hi @bish,

We just activated two feature flags for you org, so you should now be able to access custom python libraries, go to settings->advanced and click the button "check key" which will update these features.

I can definitely make a feature request to our eng team for a better interface to add in IRSA to Kubernetes Service Accounts.

Let me know if you have any specific ideas/requests/thoughts on how we could best set this up and make it easy to use and robust for different use cases!

We would definitely appreciate any and all feedback for the feature and it would help narrow down the scope of development :man_technologist:

hey @Jack_T thanks for that - I can see the cog and modify python libs now.

However, I seem to hit error after error so I may have to call this a day and revert our ETL to AWS for now...

Just to give you some context as to our use case here, we were wanting to utilise pyarrow to load DB data into parquet files in S3.

I was able to get this working by manually installing pyarrow into our containers when we were using the non-sandboxed execution.

With sandboxing and custom libs enabled I can install pyarrow from the workflow UI (which is great). But, it's proving to be quite painful to get working with the env variable limitations and now this new error:

image

I'm assuming some sort of networking restrictions inside nsjail? I think we're going to have to admit defeat for now unless there is some obvious reason for the error that your team can see?

Hi @bish I am sorry to hear that.

Were the feature flags able to give you access to the set up scripts(cog->setting->python configuration) to init the env vars to pass into the code executing sandbox?

That would be my best guess for getting the sandbox to sync up with AWS. Given that the s2n_init() error has a hint that a 'handshake was cancelled' might be due to the missing env vars which we can hopefully now drill into the code executor.

Let me check with our eng team on that error, unfortunately I have not seen it before. I found a stack overflow post about a similar error here. Might not be similar use case but worth a check.

Hey @Jack_T,

I have literally just this second managed to get our workflow working :partying_face:

I've had to downgrade pyarrow to 11.0.0 in order to get it to work, so something must have changed in the 12.0.0+ version of the library that's conflicting with the sandboxed environment. I wonder if your engineering team might be able to spot anything that's obvious in the 12.0.0 changelog: Apache Arrow 12.0.0 Release | Apache Arrow

We'll stick with 11.0.0 for now though.

Thanks for your support in helping us get there!

@bish Amazing news!!!

Glad you were able to get the workflow running :tada: :sweat_smile: :fire:

Will double check with the eng team on the pyarrow versioning to see if that is something we can fix or at least warn other users about for the short term.

Happy coding!

1 Like