Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • FAQ
  • RDPv10
  • UNL_OneDrive
  • atticguidelines
  • data_share
  • globus-auto-backups
  • good-hcc-practice-rep-workflow
  • hchen2016-faq-home-is-full
  • ipynb-doc
  • master
  • rclone-fix
  • sislam2-master-patch-51693
  • sislam2-master-patch-86974
  • site_url
  • test
15 results

Target

Select target project
  • dweitzel2/hcc-docs
  • OMCCLUNG2/hcc-docs
  • salmandjing/hcc-docs
  • hcc/hcc-docs
4 results
Select Git revision
  • 26-add-screenshots-for-newer-rdp-v10-client
  • 28-overview-page-for-connecting-2
  • RDPv10
  • gpu_update
  • master
  • overview-page-for-handling-data
  • patch-1
  • patch-10
  • patch-11
  • patch-12
  • patch-2
  • patch-3
  • patch-4
  • patch-5
  • patch-6
  • patch-7
  • patch-8
  • patch-9
  • runTime
  • submitting-jobs-overview
20 results
Show changes
Showing
with 1884 additions and 17 deletions
+++
title = "UNL STAT850, Fall 2017"
description = "UNL STAT850, Fall 2017."
+++
---
title: UNL STAT850, Fall 2017
summary: "UNL STAT850, Fall 2017."
---
**When: **1:00pm - 1:45pm
......
+++
title = "UNL STAT950, Fall 2017"
description = "UNL STAT950, Fall 2017"
+++
---
title: UNL STAT950, Fall 2017
summary: "UNL STAT950, Fall 2017"
---
**Date**: 9 am. - 10:15 am 09/19/2017
......
+++
title = "UNL STAT Alpha Seminar, Fall 2017"
description = "UNL STAT Alpha Seminar, Fall 2017."
+++
---
title: UNL STAT Alpha Seminar, Fall 2017
summary: "UNL STAT Alpha Seminar, Fall 2017."
---
**Date**: 11 am. 09/18/2017
......
---
title: "2018"
summary: "Listing of various HCC events for the year 2018."
---
Historical listing of HCC Events
----------
{{ children('Events/2018') }}
+++
title = "UNL STAT850, Fall 2018"
description = "UNL STAT850, Fall 2018."
+++
---
title: UNL STAT850, Fall 2018
summary: "UNL STAT850, Fall 2018."
---
**When: **12:30pm - 1:45pm
......
---
title: GP-ENGINE NRP and K8s Tutorial Logistics
summary: Links and Guides for GPN Annual Meeting Workshop
weight: 1
---
## Overview
This workshop will give researchers a hands-on introduction to leveraging the large distributed computing resources available through the [National Research Platform (NRP)]({{nrp.base_url}}). Researchers will learn how to migrate an AI/ML workflow from their local computer or Swan to the NRP. By accessing Nautilus, you can gain access to a vast array of cutting-edge GPUs, ideal for facilitating AI research and accelerating your research.
**When:** June 3rd, 10:15 am to 12:15 pm
**Learning Objectives:**
- Kubernetes Architecture and Concepts
- Hands-on with Nautilus HyperCluster
- Deploying Pods and Jobs in Kubernetes using Nautilus
- Deploying Scikit Learn ML Jobs to Kubernetes
- Deploying GPU Jobs to Kubernetes for Training Computer Vision Models
- Scaling Computer Vision Models on Kubernetes with Job Automation
These sessions will be delivered in an **in-person** format:
- **In-Person:** [260 Scheman Building](https://maps.app.goo.gl/FJB9E35Ta5N9BY969)
## Setup
{% include "../../static/markdown/events/in-person_reactions_05-2025.md" %}
### Sign in and complete form
Sign into NRP: [https://portal.nrp.ai/](https://portal.nrp.ai/)
When you are signed in, you should see your email at the top right of your screen.
Similar to:
![](https://hcc.unl.edu/docs/images/nrp-get-config.png)
Please enter your email in this form so we can give you access to the resources for the tutorial today:
???+ note "[https://forms.office.com/r/G8Gk0qM0Td](https://forms.office.com/r/G8Gk0qM0Td)"
<iframe width="640px" height="480px" src="https://forms.office.com/Pages/ResponsePage.aspx?id=rQHb_YNJbkOrNRrwQ7gYyTcgmp-IrS1MoiDlYnwuDJlURExPRTFWTk4wR0hJRk5PMEkyUTlEWUgxOCQlQCN0PWcu&embed=true" frameborder="0" marginwidth="0" marginheight="0" style="border: none; max-width:100%; max-height:100vh" allowfullscreen webkitallowfullscreen mozallowfullscreen msallowfullscreen> </iframe>
{% include "../../static/markdown/events/in-person_expectations_05-2025.md" %}
## Links and Steps
### During the Workshop
#### Open these at the start:
- [National Research Platform (NRP) Homepage]({{nrp.base_url}})
- [NRP Documentation]({{nrp.docs_url}})
- [GitHub Repository for Today](https://github.com/MUAMLL/gp-engine-tutorials)
#### When we move to the hands-on activities
1. Sign into the GP-ENGINE JupyterHub Portal: https://gp-engine.nrp-nautilus.io/
2. Select `Stack Datascience + K8s` and click `Start`. This process will take a few minutes.
3. Put up your yellow sticky note when you see it is complete.
??? tip "Want to learn how to setup your own instance for a class or lab?"
Check out the "Deploying JupyterHub" documentation: [NRP Jupyter Hub Docs]({{nrp.docs_url}}/userdocs/jupyter/jupyterhub/)
Cloning the repository:
```terminal
git clone https://github.com/MUAMLL/gp-engine-tutorials.git
```
### More Information
- [Getting Started on NRP]({{nrp.docs_url}}/userdocs/start/getting-started/)
- [Scientific Images on NRP]({{nrp.docs_url}}/userdocs/running/sci-img/)
- [NRP GitLab](https://gitlab.nrp-nautilus.io/)
- [NRP GitLab K8s Integration]({{nrp.docs_url}}/userdocs/development/k8s-integration/)
- [Globus on NRP]({{nrp.docs_url}}/userdocs/running/globus-connect/#_top)
- [NRP Jupyter Hub Docs]({{nrp.docs_url}}/userdocs/jupyter/jupyterhub/)
- [GitHub Repository for Today](https://github.com/MUAMLL/gp-engine-tutorials)
- [GP-ENGINE](https://gp-engine.org)
\ No newline at end of file
---
title: Event Info Pages for 2025
---
---
title: June Workshop Series 2025 Logistics
summary: Links and Guides for JWS 2025
weight: 2
---
## Setup
{% include "../../static/markdown/events/ood_setup_hybrid_05-2025.md" %}
## General Information and Links
{% include "../../static/markdown/events/hybrid_reactions_05-2025.md" %}
{% include "../../static/markdown/events/hybrid_expectations_05-2025.md" %}
### Links
Workshop Website: https://hcc.unl.edu/june-workshop-series-2025
Feedback Form: https://forms.office.com/r/GK1wAvhQTn
=== "Week 1 - June 5th"
**Introduction to Bash**
Command History Link: https://hcc.unl.edu/swc-history/20250605.html
Lesson Link: https://swcarpentry.github.io/shell-novice/
Slides Link: [June 5th Slides](https://uofnelincoln.sharepoint.com/:p:/s/UNL-HollandComputingCenter/ETOmr-ouWtpOiOZDsYJyv18BVIAN0jdI776QBwCdFC1LAQ?e=hUs5H9)
=== "Week 2 - June 12th"
**Introduction to Bash (cont.) and Introduction to HCC**
Bash Command History Link: https://hcc.unl.edu/swc-history/20250612.bash.html
HPC Intro Command History Link: https://hcc.unl.edu/swc-history/20250612.html
Slides Link: [June 12th Slides](https://uofnelincoln.sharepoint.com/:p:/s/UNL-HollandComputingCenter/EZnPMiJjsuhOnltax40z3coBfjvpXkaeyCH-Ge-94WZVwA?e=xMjU6t&nav=eyJzSWQiOjI2MywiY0lkIjoxMzEzMzYwNTUzfQ)
=== "Week 3 - June 19th"
**Introduction to HCC (cont.)**
Command History Link: https://hcc.unl.edu/swc-history/20250619.html
Slides Link: [June 19th Slides](https://uofnelincoln.sharepoint.com/:p:/s/UNL-HollandComputingCenter/EZnPMiJjsuhOnltax40z3coBfjvpXkaeyCH-Ge-94WZVwA?e=Wb3HRk&nav=eyJzSWQiOjQxNSwiY0lkIjo4MzgxNDM2NH0)
If you would like to pre-submit questions for next week, please fill out this form: https://forms.office.com/r/eUyfF4YH52
=== "Week 4 - June 26th"
**Using Software on HCC and Panel**
Command History Link: https://hcc.unl.edu/swc-history/20250626.html
Slides Link: [June 26th Slides](https://uofnelincoln.sharepoint.com/:p:/s/UNL-HollandComputingCenter/EZnPMiJjsuhOnltax40z3coBfjvpXkaeyCH-Ge-94WZVwA?e=i4pLV4&nav=eyJzSWQiOjQwOCwiY0lkIjoxOTY4NTI1MTYxfQ)
If you would like to pre-submit questions, please fill out this form: https://forms.office.com/r/eUyfF4YH52
---
title: General Workshop Materials
description: General Setup and Materials for various HCC workshops
---
## What are the purpose of the sticky notes?
Quick explanation of them with a diagram. Pretend there is a Microsoft Paint drawing here with colored boxes.
## Troubleshooting Logging into Swan
1. Sign into [Swan OpenOnDemand](https://swan-ood.unl.edu) in your web browser: [https://swan-ood.unl.edu](https://swan-ood.unl.edu)
If you are having issues signing:
=== "Incorrect Credentials / Password"
To reset your password, please use the MyHCC portal at https://hcc.unl.edu/myhcc and select "I forgot my password"
=== "DUO is not sending a notification"
If you setup a new mobile device since creating your HCC account, you will likely need to reactivate your DUO Mobile App. During a live event, please put up a "Red" sticky note and a staff member will assist. Have your phone ready with the DUO Mobile App opened.
=== "DUO shows that my account is disabled"
This means your account either has not had its setup finished yet or was locked out due to inactivity on your HCC account. During a live event, please put up a "Red" sticky note and a staff member will assist. Have your phone ready with the DUO Mobile App opened.
2. Open the training specific webpage. This will have been shared with you in an email, as a part of your class, and is also available on HCC's events page at https://hcc.unl.edu/upcoming-events.
## Training Specific information
=== "Kickstart"
General info for HCC Kickstarts
=== "June Workshop Series"
General info for HCC JWS
=== "Classroom Tutorial"
General info for Classroom Introductions
---
title: Events
summary: "Historical listing of various HCC events."
hidden: true
---
Historical listing of HCC Events
----------
{{ children('Events') }}
---
title: HCC Class Info for Instructors
summary: "Information for instructors teaching class using HCC resources"
---
The Holland Computing Center (HCC) provides high-performance computing resources for Universities and Colleges in the state of Nebraska. HCC supports both research and creative, and class activity.
This guide provides a list of useful links for classes utilizing HCC resources.
## Useful links for instructors
- Class group creation form, https://hcc.unl.edu/new-group-request
- Class group renewal form, https://hcc.unl.edu/class-renewal-request
- Link to HCC class policy page, https://hcc.unl.edu/hcc-policies#class-groups
## Useful notes for instructors
- Each student request to join the class HCC group needs to be approved by the instructor. To ease this process, you can send us roster of the class so we can create the accounts without further approval.
- Some classes may utilize custom conda environments. Some conda environments, especially the GPU-based ones, **can be very large and can easily exceed the available $HOME quotas** of 20GBs. To avoid this, **we recommend creating group-level custom conda environment**. If you are interested in this, please email hcc-support@unl.edu for further information.
- If data needs to be shared among all class members, please see https://hcc.unl.edu/docs/handling_data/swan_data_sharing/#using-group-level-shared-directory for best practices.
- Some instructors utilize the **command-line to access Swan** (https://hcc.unl.edu/docs/connecting/terminal/), while some instructors utilze the **HCC OnDemand web portal** (https://hcc.unl.edu/docs/open_ondemand/). The HCC OnDemand portal provides graphical interface to Swan, and includes many graphical applications such as JupyterLab, RStudio, Virtual Desktop, etc.
- We provide many software tools and packages as system-wide modules on Swan, https://hcc.unl.edu/docs/applications/modules/available_software_for_swan/. If you need Linux software package that is not currently available, you can request for it to be installed system-wide using the Software Install Request Form https://hcc.unl.edu/software-installation-request.
- You can request in-class HCC training/presentation, https://hcc.unl.edu/docs/faq/#can-hcc-provide-training-for-my-group.
- Creating and setting HCC accounts takes some time. **To motivate students to create HCC accounts before the training/assignments are due, some instructors make this process a graded assignment**.
- We would strongly recommend you to send Canvas announcement to your students with various useful links and information, especially if in-class HCC training/presentation is scheduled. You can use the template below, where you need to change `COURSE_GROUP` and the dates at the end. Please feel free to modify the template accordingly.
```
We will be utilizing the Holland Computing Center (HCC) for our course this semester and will be having a hands-on tutorial during the first few weeks of the semester.
To use these resources, you will need an account associated with HCC that is separate from your University of Nebraska account and credentials.
Below are links to get access to HCC resources specific for this class:
- If you already have an HCC account from a prior class or research, you can add the class group to your account or move your account to the COURSE_GROUP group: https://hcc.unl.edu/group-addchange-request.
- If you need an HCC account: https://hcc.unl.edu/new-user-request.
For more useful links and notes for students taking class using HCC resources, please see https://hcc.unl.edu/docs/faq/class_students.
While using HCC resources for this class, you will need to be aware of HCC's class account guidelines and center's policies: https://hcc.unl.edu/hcc-policies.
Pay special attention to the guidelines for class accounts since data is removed at the end of the semester: https://hcc.unl.edu/hcc-policies#class-groups.
**You will need to activate your DUO two factor for your HCC account in order to use the resources. Without this you will not have access!**
This will need completed by <INSERT DATE HERE> since we will have staff from HCC providing a tutorial on <INSERT DATE HERE>.
```
---
title: HCC Class Info for Students
summary: "Information for students taking class using HCC resources"
---
The Holland Computing Center (HCC) provides high-performance computing resources for Universities and Colleges in the state of Nebraska. HCC supports both research and creative, and class activity.
This guide provides a list of useful links for classes utilizing HCC resources.
## Useful links for students
### Account creation and setup
- New user account form, https://hcc.unl.edu/new-user-request/
- If you already have existing HCC account, use the Add/Change group request form instead, https://hcc.unl.edu/group-addchange-request
- How to activate HCC Duo, https://hcc.unl.edu/docs/accounts/setting_up_and_using_duo/
- How to reset your HCC password, https://hcc.unl.edu/docs/accounts/how_to_change_your_password/
- What are good HCC practices, https://hcc.unl.edu/docs/good_hcc_practices/
### Application links
- HCC Open OnDemand (OOD) web access to Swan, https://swan-ood.unl.edu/
- Detailed HCC Open OnDemand guide, https://hcc.unl.edu/docs/open_ondemand/
- Creating custom CPU/GPU conda environments, https://hcc.unl.edu/docs/applications/user_software/using_anaconda_package_manager/#creating-custom-anaconda-environments
- Various SLURM examples of applications that run on Swan, https://github.com/unlhcc/job-examples/
- How to share data across group members, https://hcc.unl.edu/docs/handling_data/swan_data_sharing/
### Training
- HCC info sheet, https://go.unl.edu/hcc_info_sheet
- HCC SLURM scheduler cheat sheet, https://go.unl.edu/hcc_slurm_cheatsheet
- Conda user cheat sheet, https://know.continuum.io/rs/387-XNW-688/images/conda-cheatsheet.pdf
- YouTube videos of previous HCC training events, https://www.youtube.com/@hollandcomputingcenter8962
- Upcoming HCC events, https://hcc.unl.edu/upcoming-events
### Support
- For issues related to classwork or assignments, **please contact your TA**.
- For general HCC issues, **please first check the FAQ page**, https://hcc.unl.edu/docs/faq/.
- For issues with HCC resources (e.g., unavailable resources, sudden errors and job failures), contact HCC support at hcc-support@unl.edu. **Please note that non-critical tickets/issues will be addressed during business hours (Monday through Friday, 9am-5pm CST).** Critical issues include, but not limited to, sudden unavailability to login and file-system irresponsiveness for all HCC users.
## Useful notes for students
- The **login node** is shared among all HCC users and it **should be used only for light tasks**, and not for running computationally intensive tasks. For any CPU or memory intensive operations, such as testing and running applications, one should use an interactive session (https://hcc.unl.edu/docs/submitting_jobs/creating_an_interactive_job/), or submit a job to the batch queue (https://hcc.unl.edu/docs/submitting_jobs/).
- If the Open OnDemand resources are not enough, you can run JupyterLab Notebooks within SLURM job, https://hcc.unl.edu/docs/applications/submitting_jupyter_code/running_jupyter_lab_code.
- Custom conda environments are created in $HOME by default. **You can change the location of the conda environment**, https://hcc.unl.edu/docs/applications/user_software/using_anaconda_package_manager/#using-common-for-environments/.
- Custom conda environments can easily fill the 20GBs quota on $HOME. **To clean space, you can remove unused Anaconda packages and caches**, https://hcc.unl.edu/docs/applications/user_software/using_anaconda_package_manager/#remove-unused-anaconda-packages-and-caches.
- Open OnDemand can not be launched and will show error **if your $HOME quota is exceeded**.
- If you are not sure if the storage utilization is causing issues for you, you can run `ncdu` (https://hcc.unl.edu/docs/faq/#how-can-i-check-which-directories-utilize-the-most-storage-on-swan) from the Swan terminal with the location in question.
---
title: Common Retirement
summary: "Common Retirement and FAQ"
---
- [Summary](#summary)
- [Where can I move my data? ](#where-can-i-move-my-data)
- [When is the data stored on Common going away?](#when-is-the-data-stored-on-common-going-away)
- [How do I get access to NRDStor?](#how-do-i-get-access-to-nrdstor)
- [How can I migrate my data from Common to NRDstor or Attic?](#how-can-i-migrate-my-data-from-common-to-nrdstor-or-attic)
- [I use $COMMON in my submit scripts, what should I do now?](#i-use-common-in-my-submit-scripts-what-should-i-do-now)
- [Will my Globus Collection on a directory under $COMMON continue to be active?](#will-my-globus-collection-on-a-directory-under-common-continue-to-be-active)
- [Our group has shared directory on $COMMON, what should we do?](#our-group-has-shared-directory-on-common-what-should-we-do)
- [How can I move my conda environment from $COMMON?](#how-can-i-move-my-conda-environment-from-common)
- [Can I run jobs from $NRDSTOR?](#can-i-run-jobs-from-nrdstor)
---
{% include "static/markdown/common-power-off.md" %}
## Summary
HCC will be retiring the $COMMON filesystem on July 1st, 2025 and powering it off shortly after.
**Important Dates:**
- Common retired on July 1st, 2025
#### Why are my jobs not starting due to Licenses?
As part of the finalization of the retirement of $COMMON, we removed the Slurm licenses for $COMMON, preventing any jobs from utilizing the file system.
When $COMMON is fully powered off, any submit scripts containing `--licenses=common` will fail to submit as the license won't exist anymore.
As part of the finalization of the retirement of $COMMON, the Slurm licenses for $COMMON were paused, preventing any jobs from utilizing the file system after July 1st.
This allows any jobs that were still using $COMMON on June 30th to finish running before $COMMON is retired.
Any job which cannot finish before $COMMON is scheduled to begin will pend and show this message. If you are sure your job can finish in time, you can lower the requested time to be less than the interval before the retirement.
Use this with care however to ensure your job isn't prematurely terminated.
When $COMMON is fully powered off, any submit scripts containing `--licenses=common` will fail to submit as the license won't exist anymore.
### Where can I move my data?
You can move your data to [NRDStor](../handling_data/data_storage/NRDSTOR/), or [Attic](../handling_data/data_storage/using_attic/). In any case, please make sure you have your important data backed up in another location. For large data transfers, it is strongly encouraged to use [Globus](../handling_data/data_transfer/globus_connect/) to transfer your data. The Globus transfer servers for Swan, and Attic provide a faster connection for the data transfer and provide checks on both ends of the data transfer. NRDStor is accessible from the Swan Globus Collection using the `/mnt/nrdstor/` path or via the `hcc#nrdstor` Globus endpoint.
The retirement of Common also provides a good opportunity for researchers to evaluate and archive data that is no longer being used.
### When is the data stored on Common going away?
Data on Common will be removed shortly after the retirement on July 1st, 2025.
### How do I get access to NRDStor?
NRDStor on Swan is accessible for all HCC users by default.
You can access your NRDStor directory using the $NRDSTOR environmental variable (i.e. `cd $NRDSTOR`).
On the other hand, accessing [NRDStor](/handling_data/data_storage/NRDSTOR/) locally through the CIFS/SMB protocol requires separate access. The steps to get access are:
1. Complete the ["Introduction to NRDStor" course](https://nebraska.bridgeapp.com/learner/courses/21f78713/enroll).
2. [Request NRDStor Access](https://hcc.unl.edu/request-nrdstor-access)
3. Wait for the HCC group owner to approve of the request.
Once the HCC group owner has approved the request, HCC will grant access to NRDStor. After access has been granted, please review the [NRDStor Documentation.](../handling_data/data_storage/NRDSTOR/)
!!! warning
Accessing NRDStor locally through the CIFS/SMB protocol requires NU VPN and is only available for researchers from the University of Nebraska system.
### How can I migrate my data from Common to NRDstor or Attic?
!!! warning "Data from members of your group"
You can only reliably migrate data which your HCC account owns.
By default, there are no per-group accounts that can access and migrate all group members' data.
**Attempting a migration of other members data will very likely result in migration errors and incomplete transfers.**
- The first step will be to [log into Globus](https://app.globus.org/) using your University credentials or your Globus account.
- [Activate the hcc#swan Globus collection](https://hcc.unl.edu/docs/handling_data/data_transfer/globus_connect/activating_hcc_cluster_endpoints/) and sign in using your HCC account.
- Activate the collection for your destination, `hcc#nrdstor` or `hcc#attic` for NRDstor and Attic respectively.
- [In the "File Manager" tab in Globus](https://hcc.unl.edu/docs/handling_data/data_transfer/globus_connect/file_transfers_between_endpoints/), select the files and directories on the $COMMON side that you wish to transfer to NRDStor or Attic.
- Click on the "Start" button above your files on NRDStor to initiate the copy.
- The transfer will automatically start in the background across HCC's transfer servers.
We also have a brief overview of the use of Globus from a prior workshop available [here](https://youtu.be/7zhjr35ellI?t=1643).
## Jobs and Workflows
### I use $COMMON in my submit scripts, what should I do now?
All your submit scripts should be modified to not use $COMMON anymore, but the directory where your moved data is stored instead. Additionally, the required line `#SBATCH --licenses=common` in your submit scripts should be removed as well.
### How can I make sure that my jobs and workflows do not use $COMMON going forward?
The use of $COMMON requires the Slurm `common` license to be passed to the job. If you remove `#SBATCH --licenses=common` from your submit script, any job that tries to use $COMMON will fail because the path will not exist within the job and result in a missing file/directory error.
This will be a surefire way to prevent COMMON from being used by active workflows.
Another step will be to review any scripts for mention of "/common" and "$COMMON".
### Can I run jobs from $NRDSTOR?
Yes, you can submit jobs from the $NRDSTOR file-system. Please note that the $WORK file-system is recommended for running jobs as a high-performance directory.
## Data Sharing
### Will my Globus Collection on a directory under $COMMON continue to be active?
No, when $COMMON is retired the storage media will be immediately erased, and any access to data in $COMMON will be lost. Before the retirement of $COMMON, please move your data elsewhere and create a new Globus Collection following the steps [here](https://hcc.unl.edu/docs/handling_data/data_transfer/globus_connect/file_sharing/).
### Our group has shared directory on $COMMON, what should we do?
Please contact hcc-support@unl.edu to have this directory moved to $WORK or $NRDSTOR.
## Conda Environments
### How can I move my conda environment from $COMMON?
For how to move and recreate existing conda environment, please see [here](https://hcc.unl.edu/docs/applications/user_software/using_anaconda_package_manager/#moving-and-recreating-existing-environment).
### I am trying to recreate my environment in another location and I am getting the error message of "Killed".
This message is caused by the login node on Swan terminating long running or computationally intensive processes.
Some conda environments are very large in size and may require many packages and dependencies.
With conda working to find a compatible set of packages, this can be a very time consuming process.
This results in the login node cleaning up those processes to keep the node performant.
To get around this, please create the environment from an [interactive job](/submitting_jobs/creating_an_interactive_job/).
This will allow you to create the environment from a worker node.
!!! note "Example Command"
```console
srun --mem=8gb --nodes=1 --ntasks-per-node=4 --time=2:00:00 --pty $SHELL
```
---
title: FAQ
summary: "HCC Frequently Asked Questions"
weight: 10
---
## Table of Content
**Account Management**
- [I have an account, now what?](#i-have-an-account-now-what)
- [How can I change or rest my password?](#how-can-i-change-or-rest-my-password)
- [How do I (re)activate Duo?](#how-do-i-reactivate-duo)
- [I want to create HCC account, but when I try to request one, I am getting the error "Your account email must match the email on record for this group". What should I do?](#i-want-to-create-hcc-account-but-when-i-try-to-request-one-i-am-getting-the-error-your-account-email-must-match-the-email-on-record-for-this-group-what-should-i-do)
- [I want to change my primary group or add an additional group to my HCC account](#i-want-to-change-my-primary-group-or-add-an-additional-group-to-my-hcc-account)
- [My account has been locked and I would like to gain access to it.](#my-account-has-been-locked-and-i-would-like-to-gain-access-to-it)
**Data Storage**
- [I just deleted some files and didn't mean to! Can I get them back?](#i-just-deleted-some-files-and-didnt-mean-to-can-i-get-them-back)
- [How can I check which directories utilize the most storage on Swan?](#how-can-i-check-which-directories-utilize-the-most-storage-on-swan)
- [I want to compress large directory with many files. How can I do that?](#i-want-to-compress-large-directory-with-many-files-how-can-i-do-that)
- [I want to share data with others on Swan. How can I do that?](#i-want-to-share-data-with-others-on-swan-how-can-i-do-that)
**Job Submission**
- [How many nodes/memory/time should I request?](#how-many-nodesmemorytime-should-i-request)
- [I am trying to run a job but nothing happens?](#i-am-trying-to-run-a-job-but-nothing-happens)
- [I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?](#i-keep-getting-the-error-slurmstepd-error-exceeded-step-memory-limit-at-some-point-what-does-this-mean-and-how-do-i-fix-it)
- [I keep getting the error "Some of your processes may have been killed by the cgroup out-of-memory handler." What does this mean and how do I fix it?](#i-keep-getting-the-error-some-of-your-processes-may-have-been-killed-by-the-cgroup-out-of-memory-handler-what-does-this-mean-and-how-do-i-fix-it)
- [I keep getting the error "Job cancelled due to time limit." What does this mean and how do I fix it?](#i-keep-getting-the-error-job-cancelled-due-to-time-limit-what-does-this-mean-and-how-do-i-fix-it)
- [My submitted job takes long time waiting in the queue or it is not running?](#my-submitted-job-takes-long-time-waiting-in-the-queue-or-it-is-not-running)
- [Why my job is showing (ReqNodeNotAvail, Reserved for maintenance) before a downtime?](#why-my-job-is-showing-reqnodenotavail-reserved-for-maintenance-before-a-downtime)
- [My job is submitted to the highmem partition and is pending with QOSMinMemory reason. What does this mean?](#my-job-is-submitted-to-the-highmem-partition-and-is-pending-with-qosminmemory-reason-what-does-this-mean)
**Open OnDemand**
- [Why my Open OnDemand JupyterLab or Interactive App Session is stuck and is not starting?](#why-my-open-ondemand-jupyterlab-or-interactive-app-session-is-stuck-and-is-not-starting)
- [My directories are not full, but my Open OnDemand JupyterLab Session is still not starting. What else should I try?](#my-directories-are-not-full-but-my-open-ondemand-jupyterlab-session-is-still-not-starting-what-else-should-i-try)
- [Why my Open OnDemand RStudio Session is crashing?](#why-my-open-ondemand-rstudio-session-is-crashing)
- [I need more resources than I can select with Open OnDemand Apps, can I do that?](#i-need-more-resources-than-i-can-select-with-open-ondemand-apps-can-i-do-that)
**Data Transfer**
- [Why I can not access files under shared Attic/Swan Globus Collection?](#why-i-can-not-access-files-under-shared-atticswan-globus-collection)
- [I used Globus to copy my data across HCC file systems. How can I check that all data was successfully transferred and the data checksums match?](#i-used-globus-to-copy-my-data-across-hcc-file-systems-how-can-i-check-that-all-data-was-successfully-transferred-and-the-data-checksums-match)
**Account Offboarding**
- [I am graduating soon, what will happen with my HCC account?](#i-am-graduating-soon-what-will-happen-with-my-hcc-account)
- [A member of my HCC group left and I need access to the data under their directories.](#a-member-of-my-hcc-group-left-and-i-need-access-to-the-data-under-their-directories)
**HCC Support and Training**
- [I want to talk to a human about my problem. Can I do that?](#i-want-to-talk-to-a-human-about-my-problem-can-i-do-that)
- [Can HCC provide training for my group?](#can-hcc-provide-training-for-my-group)
- [Can HCC provide help and resources for my workshop?](#can-hcc-provide-help-and-resources-for-my-workshop)
- [Where can I get training on using HCC resources?](#where-can-i-get-training-on-using-hcc-resources)
**Networking**
- [What IP's do I use to allow connections to/from HCC resources?](#what-ips-do-i-use-to-allow-connections-tofrom-hcc-resources)
---
## Account Management
#### I have an account, now what?
Congrats on getting an HCC account! Now you need to connect to a Holland
cluster. To do this, we use an SSH connection. SSH stands for Secure
Shell, and it allows you to securely connect to a remote computer and
operate it just like you would a personal machine.
Depending on your operating system, you may need to install software to
make this connection. Check out our documentation on [Connecting to HCC Clusters](/connecting/).
Additional details on next steps and important links for new account holders are available [here in our documentation](/FAQ/new_account/).
#### How can I change or rest my password?
Information on how to change or retrieve your password can be found on
the documentation page: [How to change your password](/accounts/how_to_change_your_password)
All passwords must be at least 8 characters in length and must contain
at least one capital letter and one numeric digit. Passwords also cannot
contain any dictionary words. If you need help picking a good password,
consider using a (secure!) password generator such as
[this one provided by Random.org](https://www.random.org/passwords)
To preserve the security of your account, we recommend changing the
default password you were given as soon as possible.
#### How do I (re)activate Duo?
!!! info "If you have not activated Duo before:**"
Please join our [Remote Open Office hours](https://hcc.unl.edu/OOH) or schedule another remote
session at [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and show your photo ID and we will be happy to activate it for you.
!!! info "If you have activated Duo previously but now have a different phone number:"
Join our [Remote Open Office hours](https://hcc.unl.edu/OOH) or schedule another remote
session at [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and show your photo ID and we will be happy to activate it for you.
!!! info "If you have activated Duo previously and have the same phone number:"
Email us at
[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
from the email address your account is registered under and we will send
you a new link that you can use to activate Duo.
#### I want to create HCC account, but when I try to request one, I am getting the error "Your account email must match the email on record for this group". What should I do?
This error message indicates that you have probably selected the *"I am the owner of this group and this account is for me."* checkbox when filling out the New User Request Form.
This checkbox should be selected only by the owner of the HCC group.
If you are not the owner of the HCC group, please do not select this checkbox and try re-submitting the form again.
#### I want to change my primary group or add an additional group to my HCC account
If you would like to change or add groups to your account, such as a class group or additional research group, please fill out our [Group Modification Request Form](https://hcc.unl.edu/group-addchange-request). Once the change is approved by the new group owner, HCC staff will make the adjustments.
#### My account has been locked and I would like to gain access to it.
HCC automatically locks inactive accounts after 1 year for security purposes.
If you would like to reactivate your HCC account in the original group, please email [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and HCC staff will start the process.
If you are wanting to activate it under a new group, please fill out our [Group Modification Request Form](https://hcc.unl.edu/group-addchange-request). Once the change is approved by the new group owner, HCC staff will make the adjustments.
---
## Data Storage
#### I just deleted some files and didn't mean to! Can I get them back?
That depends. Where were the files you deleted?
!!! info "**If the files were in your $HOME directory (/home/group/user/)**"
**It's possible.**
$HOME directories are backed up daily and we can restore your files as
they were at the time of our last backup. Please note that any changes
made to the files between when the backup was made and when you deleted
them will not be preserved. To have these files restored, please contact
HCC Support at
[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
as soon as possible.
!!! info "**If the files were in your $WORK directory (/work/group/user/) or $NRDSTOR directory ({{ hcc.nrdstor.path }})**"
**No.**
Unfortunately, the $WORK directories are created as a short term place
to hold job files. This storage was designed to be quickly and easily
accessed by our worker nodes and as such is not conducive to backups.
Any irreplaceable files should be backed up in a secondary location,
such as Attic, the cloud, or on your personal machine. For more
information on how to prevent file loss, check out [Preventing File
Loss](/handling_data/data_storage/preventing_file_loss/).
#### How can I check which directories utilize the most storage on Swan?
You can run `ncdu` from the Swan terminal with the location in question and (re)move directories and data if needed, e.g.,:
```bat
ncdu $HOME/my-folder
```
!!! note
If you have thousands or millions files in a location on Swan, please run `ncdu` only on a sub-directory you suspect may contain large numbers of files.
You may also use `ncdu` on locations in $WORK or $COMMON. Note that running `ncdu` puts additional load on the filesystem(s), so **please run it sparingly**.
HCC suggests running `ncdu` once and saving the output to a file; `ncdu` will read from this file instead of potentially scanning the filesystem multiple times.
To run `ncdu` in this manner, first scan the location using the `-o` option
```bat
ncdu -o ncdu_output.txt $HOME/my-folder
```
Then use the `-f` option to start `ncdu` graphically using this file, i.e.
```bat
ncdu -f ncdu_output.txt
```
Note that re-reading the filesystem to see changes in real time is not supported in this mode. After making changes (deleting/moving files), a new output file
will need to be created and read by repeating the steps above.
#### I want to compress large directory with many files. How can I do that?
In general, we recommend using `zip` as the archive format as `zip` files keep an index of the files.
Moreover, `zip` files can be quickly indexed by the various `zip` tools, and allow extraction of all files or a subset of files.
To compress the directory named `input_folder` into `output.zip`, you can use:
```bat
zip -r output.zip input_folder/
```
If you don't need to list and extract subsets of the archived data, we recommend using `tar` instead.
To compress the directory named `input_folder` into `output.tar.gz`, you can use:
```bat
tar zcf output.tar.gz input_folder/
```
Depending on the size of the directory and number of files you want to compress, you can perform the compressing via an [Interactive Job](/submitting_jobs/creating_an_interactive_job/) or [SLURM job](/submitting_jobs/).
#### I want to share data with others on Swan. How can I do that?
There are multiple methods of sharing data on Swan, including file permissions and Globus.
NRDStor by default has a shared directory for every group at `{{hcc.nrdstor.path}}/group_name/shared`. The $WORK filesystem can have a shared folder, but needs to be requested through email to [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and HCC staff will start the process.
More details on data sharing is available in the [Sharing Data on Swan documentation](/handling_data/swan_data_sharing/).
---
## Job Submission
#### How many nodes/memory/time should I request?
**Short answer:** We don’t know.
**Long answer:** The amount of resources required is highly dependent on
the application you are using, the input file sizes and the parameters
you select. Sometimes it can help to speak with someone else who has
used the software before to see if they can give you an idea of what has
worked for them.
Ultimately, it comes down to trial and error; try different
combinations and see what works and what doesn’t. Good practice is to
check the output and utilization of each job you run. This will help you
determine what parameters you will need in the future.
For more information on how to determine how many resources a completed
job used, check out the documentation on [Monitoring Jobs](/submitting_jobs/monitoring_jobs/).
#### I am trying to run a job but nothing happens?
Where are you trying to run the job from? You can check this by typing
the command \`pwd\` into the terminal.
**If you are running from inside your $HOME directory
(/home/group/user/)**:
Move your files to your $WORK directory (/work/group/user) and resubmit
your job. The $HOME folder is not meant for job output. You may be attempting
to write too much data from the job.
**If you are running from inside your $WORK directory:**
Contact us at
[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
with your login, the name of the cluster you are running on, and the
full path to your submit script and we will be happy to help solve the
issue.
#### I keep getting the error "slurmstepd: error: Exceeded step memory limit at some point." What does this mean and how do I fix it?
This error occurs when the job you are running uses more memory than was
requested in your submit script.
If you specified `--mem` or `--mem-per-cpu` in your submit script, try
increasing this value and resubmitting your job.
If you did not specify `--mem` or `--mem-per-cpu` in your submit script,
chances are the default amount allotted is not sufficient. Add the line
```bat
#SBATCH --mem=<memory_amount>
```
to your script with a reasonable amount of memory and try running it again. If you keep
getting this error, continue to increase the requested memory amount and
resubmit the job until it finishes successfully.
For additional details on how to monitor usage on jobs, check out the
documentation on [Monitoring Jobs](/submitting_jobs/monitoring_jobs/).
If you continue to run into issues, please contact us at
[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
for additional assistance.
#### I keep getting the error "Some of your processes may have been killed by the cgroup out-of-memory handler." What does this mean and how do I fix it?
This is another error that occurs when the job you are running uses more memory than was
requested in your submit script.
If you specified `--mem` or `--mem-per-cpu` in your submit script, try
increasing this value and resubmitting your job.
If you did not specify `--mem` or `--mem-per-cpu` in your submit script,
chances are the default amount allotted is not sufficient. Add the line
```bat
#SBATCH --mem=<memory_amount>
```
to your script with a reasonable amount of memory and try running it again. If you keep
getting this error, continue to increase the requested memory amount and
resubmit the job until it finishes successfully.
For additional details on how to monitor usage on jobs, check out the
documentation on [Monitoring Jobs](/submitting_jobs/monitoring_jobs/).
If you continue to run into issues, please contact us at
[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
for additional assistance.
#### I keep getting the error "Job cancelled due to time limit." What does this mean and how do I fix it?
This error occurs when the job you are running reached the time limit than was
requested in your submit script without finishing successfully.
If you specified `--time` in your submit script, try
increasing this value and resubmitting your job.
If you did not specify `--time` in your submit script,
chances are the default runtime of 1 hour is not sufficient. Add the line
```bat
#SBATCH --time=<runtime>
```
to your script with increased runtime value and try running it again. The maximum runtime on Swan
is 7 days (168 hours).
For additional details on how to monitor usage on jobs, check out the
documentation on [Monitoring Jobs](/submitting_jobs/monitoring_jobs/).
If you continue to run into issues, please contact us at
[hcc-support@unl.edu](mailto:hcc-support@unl.edu)
for additional assistance.
#### My submitted job takes long time waiting in the queue or it is not running?
If your submitted jobs are taking long time waiting in the queue, that usually means your account is over-utilizing and your fairshare score is low, this might be due submitting big number of jobs over the past period of time; and/or the amount of resources (memory, time) you requested for your job is big.
For additional details on how to monitor usage on jobs, check out the documentation on [Monitoring queued Jobs](/submitting_jobs/monitoring_jobs/).
#### Why my job is showing (ReqNodeNotAvail, Reserved for maintenance) before a downtime?
Jobs submitted before a downtime may pend and show _(ReqNodeNotAvail, Reserved for maintenance)_ for their status.
(Information on upcoming downtimes can be found at [status.hcc.unl.edu](https://status.hcc.unl.edu/).)
Any job which cannot finish before a downtime is scheduled to begin will pend and show this message. For example,
the downtime starts in 6 days but the script is requesting (via the `--time` option) 7 days of runtime.
If you are sure your job can finish in time, you can lower the requested time to be less than the interval before
the downtime begins (for example, 4 days if the downtime starts in 6 days). Use this with care however to ensure your
job isn't prematurely terminated. Alternatively, you can simply wait until the downtime is completed. Jobs will
automatically resume normally afterwards; no special action is required.
#### My job is submitted to the highmem partition and is pending with QOSMinMemory reason. What does this mean?
The majority of nodes in the `batch` partition on Swan have 256GBs of RAM, with a few nodes with up to 2TBs of RAM. To ensure that the jobs that require lots of memory will run on the nodes with more RAM memory, SLURM uses the `highmem` partition, which is part of the `batch` partition. **This is not an actual partition, so it can not be separately used.** SLURM internally submits the job to both `highmem` and `batch` partitions, and depending on the requested RAM memory, allocates the requested resources. During this process, when checking the job status, you may see:
```bat
$ squeue -u demo
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1000000 highmem,b job_name demo PD 0:00 1 (QOSMinMemory)
```
This message means that the job does not require high memory and it will be submitted to the `batch` partition when the requested resources are available. Once this internal process is completed, the `NODELIST(REASON)` message will be updated accordingly.
Please note that `highmem,b` is truncated from `highmem,batch`. The expanded output can be seen with:
```bat
$ squeue -u demo -o "%.18i %.20P %.8j %.8u %.2t %.10M %.6D %R"
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1000000 highmem,batch job_name demo PD 0:00 1 (QOSMinMemory)
```
!!! note
The number of nodes with high memory is limited, so please only request high amounts of memory if the job really needs it. Otherwise, you may encounter longer waiting times, lower submission priority and underutilized resources.
---
## Open OnDemand
#### Why my Open OnDemand JupyterLab or Interactive App Session is stuck and is not starting?
The most common reason for this is full `$HOME` directory. You can check the size of the directories in `$HOME` by running `ncdu` on the terminal from the `$HOME` directory.
Then, please remove any unnecessary data; move data to [$COMMON or $WORK](/handling_data/); or [back up important data elsewhere](/handling_data/data_storage/preventing_file_loss/).
If the majority of storage space in `$HOME` is utilized by `conda` environments, please [move the conda environments](/applications/user_software/using_anaconda_package_manager/#moving-and-recreating-existing-environment) or [remove unused conda packages and caches](/applications/user_software/using_anaconda_package_manager/#remove-unused-anaconda-packages-and-caches).
#### My directories are not full, but my Open OnDemand JupyterLab Session is still not starting. What else should I try?
If the inode and storage quotas of the `$HOME` and `$WORK` directories on Swan are not exceeded, and your Open OnDemand JupyterLab Session is still not starting, there are two additional things you can check:
- If a custom conda environment is not used, when installing packages with `pip` locally, these libraries are installed in `$HOME/.local`. When using other modules or applications that are based on Python/conda (e.g., Open OnDemand JupyterLab), these local installs can cause conflicts and errors. In this case, please rename the `$HOME/.local` directory (e.g., `mv $HOME/.local $HOME/.local_old`).
- Please make sure that you don't set variables such as `PYTHONPATH` in your `$HOME/.bashrc` file. If you have this variable set in the `$HOME/.bashrc` file, please comment out that line and run `source $HOME/.bashrc` to apply the changes.
Whether you have renamed the `$HOME/.local` directory and/or modified the file `$HOME/.bashrc`, please cancel and restart your JupyterLab Session.
#### Why my Open OnDemand RStudio Session is crashing?
There are two main reasons why this may be happening:
1) The requested RAM is not enough for the analyses you are performing. In this case, please terminate your running Session and start a new one requesting more RAM.
2) Some R packages installed as part of the OOD RStudio App may be incompatible with each other. In this case, please terminate your running Session and rename the directory where these packages are installed (e.g., `mv $HOME/R $HOME/R.bak`). To reduce the number of R packages you need to install, please use specific variants such as Bioconductor, Tidyverse or Geospatial when needed instead of installing Bioconductor packages using the OOD RStudio Basic variant for example.
#### I need more resources than I can select with Open OnDemand Apps, can I do that?
The Open OnDemand Apps are meant to be used for learning, development and light testing and have limited resources compared to the resources available for batch submissions. If the resources provided by OOD Apps are not enough, then they should migrate their workflow to batch script.
## Data Transfer
#### Why I can not access files under shared Attic/Swan Globus Collection?
In some occasions, errors such as _"Mapping collection to specified ID failed."_, may occur when accessing files from shared Attic/Swan Globus Collection.
In order to resolve this issue, the owner of the collection needs to login to Globus and activate the `hcc#attic` or `hcc#swan` endpoint respectively.
This should reactivate the correct permissions for the collection.
#### I used Globus to copy my data across HCC file systems. How can I check that all data was successfully transferred and the data checksums match?
Globus automatic file integrity verification using checksums is turned off for HCC-specific Globus collections/endpoints.
By performing automatic file integrity verification using checksums via Globus all the source and destination files are read again and compared, and this adds significant I/O load on the HCC file-systems.
All HCC-specific file-systems have already built-in integrity checksums, thus additional verification is not needed.
If the status of your Globus transfer is `SUCCEEDED`, then the checksums of the source and destination files should match.
If you would still like to compare the checksums of the transferred files regardless of the Globus transfer status, there are two separate ways how you can achieve this:
- Start another Globus transfer and select _both_ **"sync - only transfer new or changed files"** and **"checksum is different"** under _"Transfer & Timer Options"_ under the _File Manager_ Globus tab.
- Run `rsync` with the `--checksum` option using an [Interactive Session](https://hcc.unl.edu/docs/submitting_jobs/creating_an_interactive_job/) or [SLURM job](https://hcc.unl.edu/docs/submitting_jobs/).
---
## Account Offboarding
#### I am graduating soon, what will happen with my HCC account?
Access to HCC resources is separate from access to NU resources, so you do not lose access to HCC when you graduate.
- If the HCC account is part of a research group, the account will remain active until the owner of the group requests that the account needs to be deactivated or until the account hasn't been used for a minimum of an year, whichever comes first.
- If the account holder continues collaborating with the HCC group owner as an outside collaborator, a proof of collaboration may be required. For more information on the User regulations please see [here](https://hcc.unl.edu/hcc-policies#user-regulations).
- If the account is only part of a course group, then according to [our class policy](https://hcc.unl.edu/hcc-policies#class-groups), the account will be deactivated one week after the course end date.
#### A member of my HCC group left and I need access to the data under their directories.
User directories under `{{ hcc.swan.home.path }}`, `{{ hcc.swan.work.path }}`, and `{{ hcc.nrdstor.path }}` by default are only accessible by the individual user account for those directories.
If you need access to a user directory under HCC's filesystems, please email [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and HCC staff will start the process.
---
## HCC Support and Training
#### I want to talk to a human about my problem. Can I do that?
Of course! We have an open door policy and invite you to join our [Remote Open Office hours](https://hcc.unl.edu/OOH), schedule a remote
session at [hcc-support@unl.edu](mailto:hcc-support@unl.edu), or you can drop one of us a line and we'll arrange a time to meet: [Contact Us](https://hcc.unl.edu/contact-us).
#### Can HCC provide training for my group?
HCC can provide an introductory training (up to 2 hours) for groups (more than 2 people) on request via Zoom and In-Person.
Before submitting a request for training, please ensure everyone who will be attending has an active HCC account **and** has activated DUO for their HCC account.
Training requests can be submitted to [hcc-support@unl.edu](mailto:hcc-support@unl.edu).
Please include:
- A list of available times
- How many will be attending
- Preference on Zoom or on-site training
- If on-site, what location is it at?
An HCC staff member will reach out to confirm the date and location.
HCC also provides virtual Open Office Hours every Tuesday and Thursday from 2-3 PM. More details are available on the [Office Hours webpage](https://hcc.unl.edu/OOH).
#### Can HCC provide help and resources for my workshop?
We are happy to help with your workshop! We are able to provide up to 40 demo accounts for participants who don't already have HCC accounts and can have a staff member on-site to provide assistance with issues related to HCC and answering HCC related questions.
Please submit your request atleast **1 month in advance**. Requests may not be able to be fulfilled based on staff availability. It is strongly recommended to involve HCC staff during the initial planning of the hands-on portion of the workshop in order to provide smooth and timely experience with HCC resources.
Before submitting a request for workshop support, please fully test your materials using Swan and provide us a complete list of any software packages and environments that are needed.
Workshop support requests can be submitted to [hcc-support@unl.edu](mailto:hcc-support@unl.edu).
Please include:
- The date(s) and time(s) HCC will be utilized during the workshop, including any neccessarry setup.
- How many will be attending. If possible, please provide how many already have HCC accounts.
- Location of the workshop
- A list of software packages or environments that everyone would need.
- If you are using Open OnDemand JupterLab or RStudio, we can create custom kernel/image for the purpose of the workshop.
- If you are using a conda environment, we can create a shared environment for participants to use without the need to have participants creating their own.
An HCC staff member will reach out to confirm the date and location.
#### Where can I get training on using HCC resources?
HCC provides free and low cost training events throughout the year. Most events are held in-person, but some will be hybrid or Zoom.
New events are posted on our [upcoming events page](https://hcc.unl.edu/upcoming-events) and announced through our [hcc-announce mailing list](https://hcc.unl.edu/subscribe-mailing-list).
Past events and their materials are also available on our [past events page](https://hcc.unl.edu/past-events).
---
## Networking
#### What IP's do I use to allow connections to/from HCC resources?
Under normal circumstances no special network permissions are needed to access HCC resources. Occasionally, it may be necessary to whitelist the public IP
addresses HCC utilizes. Most often this is needed to allow incoming connections for an external-to-HCC license server, but may also be required
if your local network blocks outgoing connections. To allow HCC IP's, add the following ranges to the whitelist:
```bat
129.93.175.0/26
129.93.227.64/26
129.93.241.16/28
```
If you are unsure on how to do this, contact your local IT support staff for assistance.
For additional questions or issues with this, please [Contact Us](https://hcc.unl.edu/contact-us).
\ No newline at end of file
---
title: I have an HCC account, now what?
summary: "Information after getting a new HCC account"
---
- [Important links and information](#important-links-and-information)
- [HCC Support](#hcc-support)
- [HCC Resource Status](#hcc-resource-status)
- [Data Storage](#data-storage)
- [Getting started running jobs on HCC resources.](#getting-started-running-jobs-on-hcc-resources)
- [Account 2FA and Password](#account-2fa-and-password)
- [I forgot my password, how can I retrieve it?](#i-forgot-my-password-how-can-i-retrieve-it)
- [If I get a new phone, how do I (re)activate Duo?](#if-i-get-a-new-phone-how-do-i-reactivate-duo)
---
#### I have requested an account, what are my next steps?
Once you request an account, HCC staff will wait for a confirmation from either your instructor or lab group advisor.
When HCC recieves this confirmation, your account will be created. A temporary password will be sent to your email.
However you will still not be able to sign in just yet. First you will need to setup our multifactor authentication.
- If you are a part of the NU System (UNL, UNO, UNK, UNMC), HCC will ask if you would like to use the phone number tied to your TrueYou account. Once you confirm this number, HCC staff will have a text message sent by DUO to activate your multifactor with HCC resources.
- If you are not a part of the NU system or wish to use a different phone number than what is in TrueYou, you will need to attend one of our remote office hours where an HCC staff will assist you.
- If you do not wish to use your smartphone to authenticate with HCC resources, you can use a YubiKey with HCC resources. More information on YubiKeys is available [here](/accounts/setting_up_and_using_duo/#yubikeys)
It is strongly recommended to [change your password](/accounts/how_to_change_your_password/) as soon as possible.
At this point, you can now connect to and begin using HCC resources.
#### How do I connect to HCC resources?
Congrats on getting an HCC account! Now you need to connect to an HCC resource.
To do this, we can use an SSH connection or we can use a web browser to access the online web portal, Open OnDemand.
Both methods allows you to securely connect to a remote HCC resource and
operate it just like you would a personal machine.
Check out our documentation on [Connecting to HCC Clusters using SSH](/connecting/) and on [using Open OnDemand](/open_ondemand/).
## Important links and information
#### HCC Support
- **Real time and email support**
We have an open door policy and invite you to join our [Remote Open Office hours](https://hcc.unl.edu/OOH), schedule a remote or in-person session at [hcc-support@unl.edu](mailto:hcc-support@unl.edu), or you can drop one of us a line and we'll arrange a time to meet: [Contact Us](https://hcc.unl.edu/contact-us).
- **Class and group tutorial and introduction sessions**
If your class or group is interested in a introductory session, please [email us](mailto:hcc-support@unl.edu) and we will work to schedule a time.
- **Training events**
We also host training events throughout the year that are announced via email to all with HCC accounts and viewable on our [Upcoming Events](https://hcc.unl.edu/upcoming-events) page. Our materials are also available for prior events on our [Past Events](https://hcc.unl.edu/past-events) page.
- **HCC Courses**
In addition to live events, we also have courses available using Bridge available on our [HCC Courses](https://hcc.unl.edu/hcc-courses) page.
#### HCC Resource Status
From time to time, HCC resources may be unavailable due to scheduled maintenance or unforseen issues. Upcoming maintenance will be announced ahead of time on the [status page](https://status.hcc.unl.edu/) and via email. Any unforseen issues will be posted to the [status page](https://status.hcc.unl.edu/) once HCC staff have established the issue.
#### Data Storage
- **Where to store data:** HCC has 4 filesystems available for use. Specific details on the file systems are available on our [data storage documentation.](/handling_data/data_storage/)
- **Key information:** The main information to take note is that the Work and NRDStor filesystems do not have any backups. Any files deleted on these filesystems are **permanently lost**.
- **Purge policy:** The Work filesystem also has a purge policy in place to delete old and unused files to keep the file system performant. There is **no** purge policy on Attic, Home, Common, or NRDStor.
Attic and Home are the only file systems with backups available.
- **Preventing file loss:** It is _**critical**_ to backup any important data related to your class, creative, or research activities. HCC provides a [guide](/handling_data/data_storage/preventing_file_loss/) on how to backup important data and prevent file loss.
## Getting started running jobs on HCC resources.
Once you have your account setup and reviewed the important information above, you can begin conducting class, creative, or research activities on HCC resources.
For Swan, you will need to take a few steps.
1. [Transfer data to HCC Clusters](/handling_data/)
2. [Check software availability](/applications/)
3. [Submit jobs on HCC Clusters](/submitting_jobs/) or use [Open OnDemand](/open_ondemand/)
## Account 2FA and Password
#### I forgot my password, how can I retrieve it?
Information on how to change or retrieve your password can be found on
the documentation page: [How to change your
password](/accounts/how_to_change_your_password)
All passwords must be at least 8 characters in length and must contain
at least one capital letter and one numeric digit. Passwords also cannot
contain any dictionary words. If you need help picking a good password,
consider using a (secure!) password generator such as
[this one provided by Random.org](https://www.random.org/passwords)
To preserve the security of your account, we recommend changing the
default password you were given as soon as possible.
#### If I get a new phone, how do I (re)activate Duo?
**If you have not activated Duo before:**
If you are a part of the NU System, email us at [hcc-support@unl.edu](mailto:hcc-support@unl.edu)
from the email address your account is registered under and send the last 4 digits of your phone number and we will send you a link that you can use to activate Duo.
Please join our [Remote Open Office hours](https://hcc.unl.edu/OOH) or schedule another remote
session at [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and show your photo ID and we will be happy to activate it for you.
**If you have activated Duo previously but now have a different phone number:**
Join our [Remote Open Office hours](https://hcc.unl.edu/OOH) or schedule another remote
session at [hcc-support@unl.edu](mailto:hcc-support@unl.edu) and show your photo ID and we will be happy to activate it for you.
**If you have activated Duo previously and have the same phone number:**
Email us at [hcc-support@unl.edu](mailto:hcc-support@unl.edu)
from the email address your account is registered under and we will send
you a new link that you can use to activate Duo.
---
title: SSH host keys
summary: "SSH keys for HCC services"
---
- [Summary](#summary)
- [Swan](#swan)
- [Crane](#crane)
---
## Summary
SSH host keys securely identify SSH servers. When an SSH client connects to a server, the server presents a host key to the client. The client confirms the key has not changed from previous connections.
HCC tries to keep the SSH keys unchanged throughout the lifetime of a service, but security requirements evolve, and key changes may be necessary. If you get a key change warning, please confirm the fingerprint against the list below before updating your SSH client known hosts list. If the key fingerprint does not match, do not accept the new key and *do not enter* your credentials, and please contact HCC.
### Example warning message
```
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
```
[//]: # ( for i in /etc/ssh/ssh_host_*pub ; do ssh-keygen -lf $i ; done )
## Swan
### swan.unl.edu fingerprints
```
SHA256:qcyi6CEw1gUgumEghA+TcXFmu39MAO4Pyrt8rT6+ymk (ECDSA)
SHA256:SrpwIZFSaZ3Nt6Ne9PW/7SSHXo1sdT0QnputriPAmA0 (ED25519)
SHA256:GfkTzeP/gWn0NChgWwAqOpuSVWPNtXbjlVqy2pyRGlk (RSA)
```
### swan-xfer.unl.edu fingerprints
```
SHA256:TvONmFeLVTA3IyA1IkGqzcLnLSOYZ2lkOWUesQ33nE0 (ECDSA)
SHA256:hY4dkI8ngY//lwuOC3sUgZtRNnMl6zkWPX10ptlSgiY (ED25519)
SHA256:l8pMknftfqRVtFF+BQc2WXwbZ23QhnjbG2erzPLzrGc (RSA)
```
### known\_hosts
```
swan.unl.edu ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBGJgEGqgkU3g8tzgedOXnNmGvBmgU8wPHnoFW1MDREhfdsDwyOvq+Pu+O+vSf1B4f3Krl49VkDhk1/kzMOSa/2U=
swan.unl.edu ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFH3i+E4EKT20y+tXmnizsXN2c6Lg2SlaGjsbERegll6
swan.unl.edu ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDFjkdZ8i/zRJ1XVopK4n4YNVgNG7eKMvrUXh2gHFXPkWKoIfUAnTywZIRx/iWRTmgWMn9QPIiD9cyULFbeqOCgXg0NoBlrUxThoKVF7TlXd4XXW7cK8n3a8/H/c8PrRON+5bNE6La+b1zwnTA5KbPYkSJ4A0F622c+LH2EEta0RYzEuskfNY7l8gBWpvtX11Xj2nOstXW543xHFZETqW4C2Cz7dYxTxU1kduZPSUFYb9SAXOfr9cnUn9z1+HzfWv9s3E3pn8irwP7YVDWxEXJi4atiHnfcKhWI/sY1TDhbFlJL0mMKqeBpKlP0LR4B4ck8XaXpPqd4QpDbFtqY8Vmol8pxtjkFydPOozbshvs7MWMwfQsXqMfZRlVOJ9ecitognxn3nIffmpZpXFB+lfyYrZvCXr74MVF2QXeZzX0Js0Y7q0Xst+lacRnt/cF4grH6Z0k8LTV+CiuzUvxcjwePLHBFYmtD2bN/H418Y7IOEHsGQd5xaYMrxidwOB0DQjE=
swan-xfer.unl.edu ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBCWWKy921NTqcvbDISuOmNnViBegfwVk3Ni25Z0e8GBmTPXPDCyoVGCwAyXtiqxVFjPwNt9/zD8VWi/hInE9O8o=
swan-xfer.unl.edu ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDKgahWZaLmYmGfIOigoAGaW2Wpj84UzrpmgnXvkxMtN
swan-xfer.unl.edu ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCxyaMW7B/ifN8AyNIfX89fpzg8zE9SYTxDwqRKkW9giUQKER7GNZsF+WHlEkdkHcmv51jvFHNTwk4ZB0g/lZNAGxqJFbUIHf3K0gYCdji5MVQXqvMT40m8Flup8gxJWG07D3rqj2qeDlGxkMJQ1gUUsN90Exh---2rVkaOHMwjST2Q2HqklGdgDt9m3cgfAqhofLGNnQBUJCSy1fcpl58kJflMQXvnWvmAsUbzxoelqbifQGUgl9pdUDm+MJRAxnTI+KlNrIvDqokHXvb5Sy7UqLxVE8sg2CWmrfCReEnKzAM4cQ22bekRh9MrZqU9gLgt5Ez0DT1caFCk5xo+7MTYOfo3wd6KG+5flY+7MsgTVPHlFO8G/vZVKk1hTuPkOyByrzlbQ/IEzMQYXun6v/CKgiMvh2/rkAKX9oRnXVQjCq9o4EzKE6bkP4dYDNfdKePXANrS2J7mOpgQgKTzH4rO3wxnDG8BfOBctnyCeZo8GuRvRJoiBYD8f3SdeRSabXWU=
```
## Crane
### crane.unl.edu fingerprints
```
SHA256:0C+JzhK+Lw0Sm3HBcqn84vLIVr79ewp548Q4EnMikRQ (ECDSA)
SHA256:s/ZLlfcinYzEkamJ15Htek+0tpKlwo5uq+HajcxowRE (ED25519)
SHA256:GDH3+iqSp3WJxtUE6tXNQcWRwpf0xjYgkQrYBDX3Ir0 (RSA)
```
### crane-xfer.unl.edu fingerprints
```
SHA256:k9w+Khxea9ZZCuayJc+YIFi+9LRV68QPHXl7OSEANis (ECDSA)
SHA256:V64bGApa8S1MbvfkFQGEUL9xx7RWOzWetorPpGyBTMw (ED25519)
SHA256:WfS6HcnjbOrD8z11C1RYgF4TNZsqUnM/ZBpoTO2J4Pg (RSA)
```
### known\_hosts
```
crane.unl.edu ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBKFSyNCtoEMiMCLyvh6FsoFhfmGtJEbmeDqfoqi3tU2WaX7Dtc6Ti0k87MxfgPmLSn+wqSXDUxNL2eMdzuy9LRg=
crane.unl.edu ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILqREOJ9NtVi9d7p0pBBOumIbUJxc1Io3847pJacp8Ct
crane.unl.edu ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAv0LS4DwF7vdSk0jz5/otXyGn93hieUXu9ZFQgAVZLHyOYveBYSTyW9GCicAGtBBePbt3PvGzYARLHa9CX3K/sqAeax1DpmHuM1x6SNvIrvYySymqRJZWcvRHOo8o2Zc/5NZMmh06AtCLKuG9SxdmjuoBcfgX6AtG6gfH9t/7k2Q04qwpvaRy8cRbJCndiW8UAJblqg722m52reydqg8iN5C1QD/773yEUPdfVNVqIMsnSvoiEbSijxocaopiKiU9f1/QdvOIDyh6/b5BXp1o2zQ1fd4I0OveZDKn3QAqpX44wfuJ1u9D/Gq87K5NhyE3TqPUIe/+ZqZNzrcZmii66w==
crane-xfer.unl.edu ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBAWBwHOTjsPaftbFP1hL/rEo9E29yrEy+l82TfN6NJCvaFb4kdppDo86utmadWMB08YbUghEdOJXoNpudb7AHlI=
crane-xfer.unl.edu ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBL/nZV+XQEV19FODr1F2Z6Kn0jDhM+xEaq9QRDWetEY
crane-xfer.unl.edu ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDBggvAv6Gj6q5RnmMegh3S9YqNN2PEIyjEodbcvKyncY7r0seKVP4ABthrS3fU8bz1wSVi4nSVhMhGT4eRaDViy9NJPoebyORmiAczWgIZKip79YM+UZSQgZI0OtgfDv3Ouv2/ti1ymn0qVEHLrm7pLGto7HJlLGRpj4Acwid5T6sY3QTGyaJ64hkblYWfAryVcNIS/Qfz7wyeZEG1WE4XKPy2Ax+dRiYBiSWoORWzdMAs5OASrgtM1CeScjqrBlx3BDBi9FLzqbBq+IO/5Ee31Tvg9zBZtTgwAGwxsWhCvA+V95FDvSSjrjjNFEZ1ThuWbyDlEOmfF/B/k6wdxSTP
```
---
title: Basic Kubernetes
summary: "Basic Kubernetes"
weight: 20
---
### Setup
This section assumes you've completed the [Quick Start](quick_start.md) section.
If you are in multiple namespaces, you need to be aware of which namespace you’re working in, and either set it with `kubectl config set-context nautilus --namespace=the_namespace` or specify in each `kubectl` command by adding `-n namespace`.
### Explore the system
To get the list of cluster nodes (although you may not have access to all of them), type:
```
kubectl get nodes
```
Right now you probably don't have anything running in the namespace, and these commands will return `No resources found in ... namespace.`. There are three categories we will examine: pods, deployments and services. Later these commands will be useful to see what's running:
List all the pods in your namespace
```
kubectl get pods
```
List all the deployments in your namespace
```
kubectl get deployments
```
List all the services in your namespace
```
kubectl get services
```
### Launch a simple pod
Let’s create a simple generic pod, and login into it.
You can copy-and-paste the lines below. Create the `pod1.yaml` file with the following content:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: test-pod
spec:
containers:
- name: mypod
image: ubuntu
resources:
limits:
memory: 100Mi
cpu: 100m
requests:
memory: 100Mi
cpu: 100m
command: ["sh", "-c", "echo 'Im a new pod' && sleep infinity"]
```
Reminder, indentation is important in YAML, just like in Python.
*If you don't want to create the file and are using Mac or Linux, you can create yaml's dynamically like this:*
```
kubectl create -f - << EOF
<contents you want to deploy>
EOF
```
Now let’s start the pod:
```
kubectl create -f pod1.yaml
```
See if you can find it:
```
kubectl get pods
```
Note: You may see the other pods too.
If it is not yet in Running state, you can check what is going on with
```
kubectl get events --sort-by=.metadata.creationTimestamp
```
Events and other useful information about the pod can be seen in `describe`:
```
kubectl describe pod test-pod
```
If the pod is in Running state, we can check its logs
```
kubectl logs test-pod
```
Let’s log into it
```
kubectl exec -it test-pod -- /bin/bash
```
You are now inside the (container in the) pod!
Does it feel any different than a regular, dedicated node?
Try to create some directories and some files with content.
(Hello world will do, but feel free to be creative)
We will want to check the status of the networking.
But ifconfig is not available in the image we are using; so let’s install it.
First, let's make sure our installation tools are updated.
```
apt update
```
Now, we can use apt to install the necessary network tools.
```
apt install net-tools
```
Now check the networking:
```
ifconfig -a
```
Get out of the Pod (with either Control-D or exit).
You should see the same IP displayed with kubectl
```
kubectl get pod -o wide test-pod
```
We can now destroy the pod
```
kubectl delete -f pod1.yaml
```
Check that it is actually gone:
```
kubectl get pods
```
Now, let’s create it again:
```
kubectl create -f pod1.yaml
```
Does it have the same IP?
```
kubectl get pod -o wide test-pod
```
Log back into the pod:
```
kubectl exec -it test-pod -- /bin/bash
```
What does the network look like now?
What is the status of the files your created?
Finally, let’s delete the pod explicitly:
```
kubectl delete pod test-pod
```
### Let’s make it a deployment
You saw that when a pod was terminated, it was gone.
While above we did it by ourselves, the result would have been the same if a node died or was restarted.
In order to gain a higher availability, the use of Deployments is recommended. So, that’s what we will do next.
You can copy-and-paste the lines below.
###### dep1.yaml:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-dep
labels:
k8s-app: test-dep
spec:
replicas: 1
selector:
matchLabels:
k8s-app: test-dep
template:
metadata:
labels:
k8s-app: test-dep
spec:
containers:
- name: mypod
image: ubuntu
resources:
limits:
memory: 500Mi
cpu: 500m
requests:
memory: 100Mi
cpu: 50m
command: ["sh", "-c", "sleep infinity"]
```
Now let’s start the deployment:
```
kubectl create -f dep1.yaml
```
See if you can find it:
```
kubectl get deployments
```
The Deployment is just a conceptual service, though.
See if you can find the associated pod:
```
kubectl get pods
```
Once you have found its name, let’s log into it
```
kubectl get pod -o wide test-dep-<hash>
kubectl exec -it test-dep-<hash> -- /bin/bash
```
You are now inside the (container in the) pod!
Create directories and files as before.
Try various commands as before.
Let’s now delete the pod!
```
kubectl delete pod test-dep-<hash>
```
Is it really gone?
```
kubectl get pods
```
What happened to the deployment?
```
kubectl get deployments
```
Get into the new pod
```
kubectl get pod -o wide test-dep-<hash>
kubectl exec -it test-dep-<hash> -- /bin/bash
```
Was anything preserved?
Let’s now delete the deployment:
```
kubectl delete -f dep1.yaml
```
Verify everything is gone:
```
kubectl get deployments
kubectl get pods
```
### More tutorials are available at [Nautilus Documentation - Tutorials](https://docs.pacificresearchplatform.org)
---
title: Batch Jobs
summary: "Batch Jobs"
weight: 40
---
### Running batch jobs
#### Basic example
Kubernetes has a support for running batch jobs. A Job is a daemon which watches your pod and makes sure it exited with exit status 0. If it did not for any reason, it will be restarted up to `backoffLimit` number of times.
Since jobs in Nautilus are not limited in runtime, you can only run jobs with meaningful `command` field. Running in manual mode (`sleep infinity` `command` and manual start of computation) is prohibited.
Let's run a simple job and get its result.
Create a job.yaml file and submit:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: pi
spec:
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
resources:
limits:
memory: 200Mi
cpu: 1
requests:
memory: 50Mi
cpu: 50m
restartPolicy: Never
backoffLimit: 4
```
Explore what's running:
```
kubectl get jobs
kubectl get pods
```
When the job is finished, your pod will stay in Completed state, and Job will have COMPLETIONS field 1 / 1. For long jobs, the pods can have Error, Evicted, and other states until they finish properly or backoffLimit is exhausted.
This example job did not use any storage and outputted the result to STDOUT, which can be seen as our pod logs:
```
kubectl logs pi-<hash>
```
The pod and job will remain for you to come and look at for `ttlSecondsAfterFinished=604800` seconds (1 week) by default, and you can adjust this value in your job definition if desired.
**Please make sure you did not leave any pods and jobs behind.** To delete the job, run
```
kubectl delete job pi
```
#### Running several bash commands
You can group several commands, and use pipes, like this:
```
command:
- sh
- -c
- "cd /home/user/my_folder && apt-get install -y wget && wget pull some_file && do something else"
```
#### Logs
All stdout and stderr outputs from the script will be preserved and accessible by running
```
kubectl logs pod_name
```
Output from initContainer can be seen with
```
kubectl logs pod_name -c init-clone-repo
```
To see logs in real time do:
```
kubectl logs -f pod_name
```
The pod will remain in Completed state until you delete it or timeout is passed.
#### Retries
The backoffLimit field specifies how many times your pod will run in case the exit status of your script is not 0
or if pod was terminated for a different reason (for example a node was rebooted). It's a good idea to have it more than 0.
#### Fair queueing
There is no fair queue implemented on Nautilus. If you submit 1000 jobs, you block **all** other users from submitting in the cluster.
To limit your submission to a fair portion of the cluster, refer to [this guide](https://kubernetes.io/docs/tasks/job/fine-parallel-processing-work-queue/). Make sure to use a deployment and persistent storage for Redis pod. Here's [our example](https://gitlab.nrp-nautilus.io/prp/job-queue/-/blob/master/redis.yaml)
#### CPU only jobs
Nautilus is primarily used for GPU jobs. While it's possible to run large CPU-only jobs, you have to take certain measures to prevent taking over all cluster resources.
You can run the jobs with lower priority and allow other jobs to preempt yours. This way you should not worry about the size of your jobs and you can use the maximum number of resources in the cluster. To do that, add the `opportunistic` priority class to your pods:
```yaml
spec:
priorityClassName: opportunistic
```
Another thing to do is to avoid the GPU nodes. This way you can be sure you're only using the CPU-only nodes and jobs are not preventing any GPU usage. To do this, add the node antiaffinity for GPU device to your pod:
```yaml
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present
operator: NotIn
values:
- "true"
```
You can use a combination of 2 methods or either one.
---
title: Deployments
summary: "Deployments"
weight: 50
---
## Running an idle deployment
In case you need to have an idle pod in the cluster, that might ocassionally do some computations, you have to run it as a [Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/). Deployments in Nautilus are limited to 2 weeks (unless the namespace is added to exceptions list and runs a permanent service). This ensures your pod will not run in the cluster forever when you don't need it and move on to other projects.
Please don't run such pods as Jobs, since those are not purged by the cleaning daemon and will stay in the cluster forever if you forget to remove those.
Such a deployment **can not request a GPU**. You can use the
```
command:
- sleep
- "100000000"
```
as the command if you just want a pure shell, and `busybox`, `centos`, `ubuntu` or any other general image you like.
Follow the [guide for creating deployments](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/) and add the minimal requests to it and limits that make sense, for example:
```
resources:
limits:
cpu: "1"
memory: 10Gi
requests:
cpu: "10m"
memory: 100Mi
```
Example of running an nginx deployment:
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
k8s-app: nginx
spec:
replicas: 1
selector:
matchLabels:
k8s-app: nginx
template:
metadata:
labels:
k8s-app: nginx
spec:
containers:
- image: nginx
name: nginx-pod
resources:
limits:
cpu: 1
memory: 4Gi
requests:
cpu: 100m
memory: 500Mi
```
## Quickly stopping and starting the pod
If you need a simple way to start and stop your pod without redeploying every time, you can scale down the deployment. This will leave the definition, but delete the pod.
To stop the pod, scale down:
```
kubectl scale deployment deployment-name --replicas=0
```
To start the pod, scale up:
```
kubectl scale deployment deployment-name --replicas=1
```
---
title: GPU Pods
summary: "GPU Pods"
weight: 20
---
The Nautilus Cluster provides over 200 GPU nodes. In this section you will request GPUs. Make sure you don't waste those and delete your pods when not using the GPUs.
Use this definition to create your own pod and deploy it to kubernetes \(refer to [Basic Kubernetes](basic_kubernetes.md)\):
```yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod-example
spec:
containers:
- name: gpu-container
image: gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/prp:latest
command: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: 1
```
This example requests 1 GPU device. You can have up to 2 for pods. If you request GPU devices in your pod,
kubernetes will auto schedule your pod to the appropriate node. There's no need to specify the location manually.
**You should always delete your pod** when your computation is done to let other users use the GPUs.
Consider using [Jobs]({{ nrp.docs_url }}/userdocs/running/jobs/) **with actual script instead of `sleep`** whenever possible to ensure your pod is not wasting GPU time.
If you have never used Kubernetes before, see the [tutorial]({{ nrp.docs_url }}/userdocs/start/getting-started/).
#### Requesting high-demand GPUs
Certain kinds of GPUs have much higher specs than the others, and to avoid wasting those for regular jobs, your pods will only be scheduled on those if you request the type explicitly.
Currently those include:
* NVIDIA-TITAN-RTX
* NVIDIA-RTX-A5000
* Quadro-RTX-6000
* Tesla-V100-SXM2-32GB
* NVIDIA-A40
* NVIDIA-RTX-A6000
* Quadro-RTX-8000
* NVIDIA-A100-SXM4-80GB*
*A100 running in [MIG mode](#mig-mode) is not considered high-demand one.
#### Requesting many GPUs
Since 1 and 2 GPU jobs are blocking nodes from getting 4 and 8 GPU jobs, there are some nodes reserved for those. Once you submit a job requesting 4 or 8 GPUs, a controller will automatically add toleration which will allow you to use the node reserved for more GPUs. You don't need to do anything manually for that.
#### Choosing GPU type
We have a variety of GPU flavors attached to Nautilus. You can get a list of GPU models from the actual cluster information (f.e. `kubectl get nodes -L nvidia.com/gpu.product`).
<div id="observablehq-chart-35acf314"></div>
<p>Credit: <a href="https://observablehq.com/d/7c0f46855b4212e0">GPU types by NRP Nautilus</a></p>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@observablehq/inspector@5/dist/inspector.css">
<script type="module">
import {Runtime, Inspector} from "https://cdn.jsdelivr.net/npm/@observablehq/runtime@5/dist/runtime.js";
import define from "https://api.observablehq.com/d/7c0f46855b4212e0.js?v=4";
new Runtime().module(define, name => {
if (name === "chart") return new Inspector(document.querySelector("#observablehq-chart-35acf314"));
});
</script>
If you need more graphical memory, use the official specs to choose the type. The table below is an example of the GPU types in the Nautuilus Cluster and their memory size:
GPU Type | Memory size (GB)
---|---
NVIDIA-GeForce-GTX-1070 | 8G
NVIDIA-GeForce-GTX-1080 | 8G
Quadro-M4000 | 8G
NVIDIA-A100-PCIE-40GB-MIG-2g.10gb | 10G
NVIDIA-GeForce-GTX-1080-Ti | 12G
NVIDIA-GeForce-RTX-2080-Ti | 12G
NVIDIA-TITAN-Xp | 12G
Tesla-T4 | 16G
NVIDIA-A10 | 24G
NVIDIA-GeForce-RTX-3090 | 24G
NVIDIA-GeForce-RTX-3090 | 24G
NVIDIA-TITAN-RTX | 24G
NVIDIA-RTX-A5000 | 24G
Quadro-RTX-6000 | 24G
Tesla-V100-SXM2-32GB | 32G
NVIDIA-A40 | 48G
NVIDIA-RTX-A6000 | 48G
Quadro-RTX-8000 | 48G
**NOTE**: [Not all nodes are available to all users]({{ nrp.docs_url }}/userdocs/running/special/). You can consult about your available resources in [Matrix]({{ nrp.docs_url }}/userdocs/start/support) and on [resources page]({{ nrp.resources.url }}).
Labs connecting their hardware to our cluster have preferential access to all our resources.
To use a **specific type of GPU**, add the affinity definition to your pod yaml
file. The example below specifies *1080Ti* GPU:
```yaml
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- NVIDIA-GeForce-GTX-1080-Ti
```
**To make sure you did everything correctly** after you've submitted the job, look at the corresponding pod yaml (`kubectl get pod ... -o yaml`) and check that resulting nodeAffinity is as expected.
#### Selecting CUDA version
In general the higher CUDA versions support the lower and same driver versions. The nodes are labelled with the major and minor CUDA and driver versions. You can check those at the [resources page]({{ nrp.resources.url }}) or list with this command (it will also choose only GPU nodes):
```bash
kubectl get nodes -L nvidia.com/cuda.driver.major,nvidia.com/cuda.driver.minor,nvidia.com/cuda.runtime.major,nvidia.com/cuda.runtime.minor -l nvidia.com/gpu.product
```
If you're using the container image with higher CUDA version, you have to pick the nodes supporting it. Example:
```yaml
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/cuda.runtime.major
operator: In
values:
- "12"
- key: nvidia.com/cuda.runtime.minor
operator: In
values:
- "2"
```
Also you can choose the driver above something if you know which one you need (this will pick drivers **above** 535):
```yaml
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/cuda.driver.major
operator: Gt
values:
- "535"
```
#### MIG mode
A100 GPUs allow slicing those into several logical GPUs ( [MIG mode](https://docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#a100-profiles) ). This mode is enabled in our cluster. Things can change, but currently we're thinking about slicing those in halves. The current MIG mode can be obtained from nodes via the `nvidia.com/gpu.product` label: `NVIDIA-A100-PCIE-40GB-MIG-2g.10gb` means 2 compute instances (out of 7 total) and 10GB memory per virtual GPU.