Note
You will need to setup a separate cloud account in order to run Vorpal in the Cloud. These instruction are for using the Amazon Web Services cloud provider, which has nice tools for setting up an HPC Cluster.
Running Vorpal in the Amazon cloud or AWS (Amazon Web Services) on a cluster is now possible. These instructions will show how to setup a user account with the correct permissions, create a storage area, download the ParallelCluster tools, and configure and start a cluster.
These instructions are focused on the minimum needed to get the user up and running on AWS. For more detailed instructions about HPC cluster please see:
In the cloud, operations are done with Web Services, so starting a cluster involves interating with server AWS services. The full architecture diagram is shown in figure Fig. 94 below. In these instructions we will be referring to various services and you will need to interact with the website to use these service to setup the cluster.
Before one can run Vorpal on the AWS cloud, at least two accounts have to be setup. The root account manages information for the organization (sets up credit card payments, adds other users, etc.), so this is the first account. The second account will be for the user that runs Vorpal and will need special permissions in order to create a cluster.
To create a root account, follow these steps:
We need to create two policies, two groups associated with these polices, and the a user that is a member of both groups. To do this, follow these steps:
Now we create two policies:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:DescribeKeyPairs",
"ec2:DescribeRegions",
"ec2:DescribeVpcs",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups",
"ec2:DescribePlacementGroups",
"ec2:DescribeImages",
"ec2:DescribeInstances",
"ec2:DescribeInstanceStatus",
"ec2:DescribeSnapshots",
"ec2:DescribeVolumes",
"ec2:DescribeVpcAttribute",
"ec2:DescribeAddresses",
"ec2:CreateTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeAvailabilityZones"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "EC2Describe"
},
{
"Action": [
"ec2:CreateVolume",
"ec2:RunInstances",
"ec2:AllocateAddress",
"ec2:AssociateAddress",
"ec2:AttachNetworkInterface",
"ec2:AuthorizeSecurityGroupEgress",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CreateNetworkInterface",
"ec2:CreateSecurityGroup",
"ec2:ModifyVolumeAttribute",
"ec2:ModifyNetworkInterfaceAttribute",
"ec2:DeleteNetworkInterface",
"ec2:DeleteVolume",
"ec2:TerminateInstances",
"ec2:DeleteSecurityGroup",
"ec2:DisassociateAddress",
"ec2:RevokeSecurityGroupIngress",
"ec2:RevokeSecurityGroupEgress",
"ec2:ReleaseAddress",
"ec2:CreatePlacementGroup",
"ec2:DeletePlacementGroup"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "EC2Modify"
},
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "AutoScalingDescribe"
},
{
"Action": [
"autoscaling:CreateAutoScalingGroup",
"ec2:CreateLaunchTemplate",
"ec2:ModifyLaunchTemplate",
"ec2:DeleteLaunchTemplate",
"ec2:DescribeLaunchTemplates",
"ec2:DescribeLaunchTemplateVersions",
"autoscaling:PutNotificationConfiguration",
"autoscaling:UpdateAutoScalingGroup",
"autoscaling:PutScalingPolicy",
"autoscaling:DescribeScalingActivities",
"autoscaling:DeleteAutoScalingGroup",
"autoscaling:DeletePolicy",
"autoscaling:DisableMetricsCollection",
"autoscaling:EnableMetricsCollection"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "AutoScalingModify"
},
{
"Action": [
"dynamodb:DescribeTable"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "DynamoDBDescribe"
},
{
"Action": [
"dynamodb:CreateTable",
"dynamodb:DeleteTable",
"dynamodb:TagResource"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "DynamoDBModify"
},
{
"Action": [
"sqs:GetQueueAttributes"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "SQSDescribe"
},
{
"Action": [
"sqs:CreateQueue",
"sqs:SetQueueAttributes",
"sqs:DeleteQueue",
"sqs:TagQueue"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "SQSModify"
},
{
"Action": [
"sns:ListTopics",
"sns:GetTopicAttributes"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "SNSDescribe"
},
{
"Action": [
"sns:CreateTopic",
"sns:Subscribe",
"sns:DeleteTopic"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "SNSModify"
},
{
"Action": [
"cloudformation:DescribeStackEvents",
"cloudformation:DescribeStackResource",
"cloudformation:DescribeStackResources",
"cloudformation:DescribeStacks",
"cloudformation:ListStacks",
"cloudformation:GetTemplate"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "CloudFormationDescribe"
},
{
"Action": [
"cloudformation:CreateStack",
"cloudformation:DeleteStack",
"cloudformation:UpdateStack"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "CloudFormationModify"
},
{
"Action": [
"s3:Get*",
"s3:List*"
],
"Resource": "arn:aws:s3:::<REGIION>-aws-parallelcluster*",
"Effect": "Allow",
"Sid": "S3ParallelClusterReadWrite"
},
{
"Action": [
"iam:PassRole",
"iam:CreateRole",
"iam:DeleteRole",
"iam:GetRole",
"iam:TagRole",
"iam:SimulatePrincipalPolicy"
],
"Resource": "arn:aws:iam::<ACCOUNT_NUMBER>:role/vsim_cluster_role",
"Effect": "Allow",
"Sid": "IAMModify"
},
{
"Action": [
"iam:CreateInstanceProfile",
"iam:DeleteInstanceProfile"
],
"Resource": "arn:aws:iam::<ACCOUNT_NUMBER>:instance-profile/*",
"Effect": "Allow",
"Sid": "IAMCreateInstanceProfile"
},
{
"Action": [
"iam:AddRoleToInstanceProfile",
"iam:RemoveRoleFromInstanceProfile",
"iam:GetRolePolicy",
"iam:GetPolicy",
"iam:AttachRolePolicy",
"iam:DetachRolePolicy",
"iam:PutRolePolicy",
"iam:DeleteRolePolicy"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "IAMInstanceProfile"
},
{
"Action": [
"efs:DescribeMountTargets",
"efs:DescribeMountTargetSecurityGroups",
"ec2:DescribeNetworkInterfaceAttribute"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "EFSDescribe"
},
{
"Action": [
"ssm:GetParametersByPath"
],
"Resource": "*",
"Effect": "Allow",
"Sid": "SSMDescribe"
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:DescribeVolumes",
"ec2:AttachVolume",
"ec2:DescribeInstanceAttribute",
"ec2:DescribeInstanceStatus",
"ec2:DescribeInstances",
"ec2:DescribeRegions"
],
"Resource": [
"*"
],
"Effect": "Allow",
"Sid": "EC2"
},
{
"Action": [
"dynamodb:ListTables"
],
"Resource": [
"*"
],
"Effect": "Allow",
"Sid": "DynamoDBList"
},
{
"Action": [
"sqs:SendMessage",
"sqs:ReceiveMessage",
"sqs:ChangeMessageVisibility",
"sqs:DeleteMessage",
"sqs:GetQueueUrl"
],
"Resource": [
"arn:aws:sqs:<REGION>:<ACCOUNT_NUMBER>:parallelcluster-*"
],
"Effect": "Allow",
"Sid": "SQSQueue"
},
{
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"autoscaling:SetDesiredCapacity",
"autoScaling:UpdateAutoScalingGroup",
"autoscaling:DescribeTags",
"autoScaling:SetInstanceHealth"
],
"Resource": [
"*"
],
"Effect": "Allow",
"Sid": "Autoscaling"
},
{
"Action": [
"cloudformation:DescribeStacks",
"cloudformation:DescribeStackResource"
],
"Resource": [
"arn:aws:cloudformation:<REGION>:<ACCOUNT_NUMBER>:stack/parallelcluster-*/*"
],
"Effect": "Allow",
"Sid": "CloudFormation"
},
{
"Action": [
"dynamodb:PutItem",
"dynamodb:Query",
"dynamodb:GetItem",
"dynamodb:DeleteItem",
"dynamodb:DescribeTable"
],
"Resource": [
"arn:aws:dynamodb:<REGION>:<ACCOUNT_NUMBER>:table/parallelcluster-*"
],
"Effect": "Allow",
"Sid": "DynamoDBTable"
},
{
"Action": [
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<REGION>-aws-parallelcluster/*"
],
"Effect": "Allow",
"Sid": "S3GetObj"
},
{
"Action": [
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<REGION>-aws-parallelcluster/*"
],
"Effect": "Allow",
"Sid": "S3PutObj"
},
{
"Action": [
"sqs:ListQueues"
],
"Resource": [
"*"
],
"Effect": "Allow",
"Sid": "SQSList"
},
{
"Action": [
"iam:PassRole",
"iam:CreateRole"
],
"Resource": [
"arn:aws:iam::<ACCOUNT_NUMBER>:role/parallelcluster-*"
],
"Effect": "Allow",
"Sid": "BatchJobPassRole"
}
]
}
Now we create two groups for the two policies:
Next, we add a Role for the EC2 service to use to be able to access the AWS services:
We then add the user:
$ mkdir ~/.aws
$ touch ~/.aws/config
$ chmod 600 ~/.aws/config
Then edit this “config” file. Copy the following contents into the config file and replace the <ACCESS_KEY_ID> and the <SECRET_ACCESS_KEY> from the browser and <REGION> from the region you are using.
[default]
aws_access_key_id=<ACCESS_KEY_ID>
aws_secret_access_key=<SECRET_ACCESS_KEY>
region=<REGION>
This file will be used later to contact the various Amazon Web Services needed to create a cluster.
You can also download a comma-separated-values (.csv) file with the user’s creditials to save the access key information.
These instructions take you through setting up an S3 storage bucket and uploading VSim to it.
Short term storage is only used during the cluster instance operation. Files are sync’d back and forth to the long term storage on cluster creation and deletion. There are two basic options for this storage: (1) EFS, an NFS type system; and (2) FSx, a Lustre high performance system. Option 2 is more expensive, but more performant and you can create both and go back and forth between clusters with one or the other. So, we give instructions for using EFS first.
At this point we need to upload a public key. Usually, this is in the ~/.ssh directory. On Mac, one can not navigate to the directory via the web browser, we can copy the public key to a directory that is available, like Documents.
$ cp ~/.ssh/id_rsa.pub ~/Documents
These instructions describe how to download a Python 3 based tool from AWS needed to lauch a cluster from the command-line:
Download and install Python 3 if you don’t already have it. On Mac, this can be done with homebrew:
$ python --version
$ brew install python3 # if not version 3.7
Ensure that python3 is first in your PATH then install the cluster tools. For example, on Mac you would do the following:
$ which pip3 # should return a full path to python3 distribution
$ pip3 install aws-parallelcluster --upgrade --user
$ export PATH=~/Library/Python/3.7/bin:$PATH
$ which pcluster # should return a full path
On linux, python puts pip installed packages in ~/.local/bin and this is likely already in your path, so there would be no need to for the export line.
After pcluster is installed create the configuration directory and file (~/.parallelcluster/config) with the following contents, replacing XXXXXXXX with the vpc and subnet IDs. The vpc id can be found at the EC2 dashboard by clicking on the Default VPC link in the upper right of the page. And in the VPC Dashboard, click on the Subnets link in the left navigation bar. You should see a subnet ID for each Availabilty Zone. Choosing the first one is fine.
[aws]
aws_region_name = <REGION>
[cluster default]
key_name = vsimcluster
initial_queue_size = 0
max_queue_size = 16
min_queue_size = 0
placement_group = DYNAMIC
cluster_type = spot
base_os = centos7
scheduler = slurm
master_instance_type = c4.4xlarge
compute_instance_type = c4.4xlarge
vpc_settings = vsimvpc
ebs_settings = vsimebs
s3_read_write_resource = arn:aws:s3:::<BUCKET_NAME>*
[vpc vsimvpc]
vpc_id = vpc-<VPC_ID>
master_subnet_id = subnet-<SUBNET_ID>
[ebs vsimebs]
shared_dir = /vsim
volume_type = gp2
volume_size = 100
[global]
cluster_template = default
update_check = true
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
At this point, you should be setup to start an AWS Cloud Cluster with the ParallelCluster tool.
$ export PATH=~/Library/Python/3.7/bin:$PATH
$ pcluster create mycluster
This will take several minutes. You can watch for the Master node being created under the Running Instances link on the EC2 Dashboard. Also, you may receive a message on your first startup saying that your “request for accessing resources in this region is being validated”
When the prompt returns, the login information will be printed out. Use that login information (account and public IP) to login, with a command like:
ssh ec2-user@100.20.239.156
You should not need a password because you will be using your private/public key pair that you imported above. Once you are log into the cluster, you can look around with commands like the following
$ ls /
$ which sbatch
$ sinfo
You should see the /vsim file system and that the scheduler is present and working. You can try a simple slurm batch script with the following contents:
#!/bin/sh
#SBATCH --job-name=test
#SBATCH --output=test.out
#SBATCH --error=test.err
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --time=00:01:00
/bin/hostname
srun -l /bin/hostname
srun -l which mpirun
And submit it with (Do we need the –nodelist=”$SLURM_JOB_NODELIST” arg???)
$ sbatch test.slm
You can view its progress with the squeue command:
$ squeue
And you can check that compute nodes are started by going back to the EC2 Running Instances page in the browser. When the job is completed, you should see the standard output (test.out) and and standard error (test.err) files. The error file should be empty and the output file should have hostnames and paths to MPI.
If the cluster is working well, and you are confident of how to run a batch slurm job, then you can setup VSim on the cluster. We do this by sync’ing the S3 storage with the EFS share.
$ aws s3 sync s3://<BUCKET_NAME> /vsim
$ cd /vsim
$ tar xzvf VSim-11.0.0-Linux64.tar.gz
$ cp vsim11.txlic VSim-11.0/licenses/license.txt
And then to test running VSim, copy an example directory and translate the SDF file with:
$ cd /vsim
$ mkdir empulse/
$ cp VSim-11.0/Contents/examples/VSimBase/VisualSetup/emPulseInVacuum/* empulse/
$ source VSim-11.0/VSimComposer.sh
$ cd empulse
$ sdf2vpre -s emPulseInVacuum.sdf -p emPulseInVacuum.pre
Here is a slurm batch script to run a job on the cluster with this input file:
#!/bin/bash
#SBATCH --job-name=empulse
#SBATCH --output=empulse.out
#SBATCH --error=empulse.err
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=4
#SBATCH --time=00:10:00
source /vsim/VSim-11.0/VSimComposer.sh
srun --hint=nomultithread --ntasks-per-node=16 vorpal -i emPulseInVacuum.pre -n 100 -d 50
If the file containing the above is named, empulse.slm, then this job is submitted with
sbatch empulse.slm
You can check on your job with
squeue
and you can stop the job with
scancel JOBID
where JOBID is the job id returned by squeue.
When you are through running you simulations on the cluster it should be shutdown with the following command:
$ export PATH=~/Library/Python/3.7/bin:$PATH
$ pcluster delete mycluster