123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272 |
- This information is meant primarily for the Slurm developers.
- System administrators should read the instructions at
- http://slurm.schedmd.com/quickstart_admin.html
- (also found in the file doc/html/quickstart_admin.shtml).
- The "INSTALL" file contains generic Linux build instructions.
- Simple build/install on Linux:
- ./configure --enable-debug \
- --prefix=<install-dir> --sysconfdir=<config-dir>
- make
- make install
- To build the files in the contribs directory:
- make contrib
- make install-contrib
- (The RPMs are built by default)
- If you make changes to any auxdir/* or Makefile.am file, then run
- _snowflake_ (where there are recent versions of autoconf, automake
- and libtool installed):
- ./autogen.sh
- then check-in the new Makefile.am and Makefile.in files
- Here is a step-by-step HOWTO for creating a new release of Slurm on a
- Linux cluster (See BlueGene and AIX specific notes below for some differences).
- 0. Get current copies of Slurm and buildfarm
- > git clone https://<user_name>@github.com/chaos/slurm.git
- > svn co https://eris.llnl.gov/svn/chaos/private/buildfarm/trunk buildfarm
- place the buildfarm directory in your search path
- > export PATH=~/buildfarm:$PATH
- 1. Update NEWS and META files for the new release. In the META file,
- the API, Major, Minor, Micro, Version, and Release fields must all
- by up-to-date. **** DON'T UPDATE META UNTIL RIGHT BEFORE THE TAG ****
- The Release field should always be 1 unless one of
- the following is true
- - Changes were made to the spec file, documentation, or example
- files, but not to code.
- - this is a prerelease (Release = 0.preX)
- 2. Tag the repository with the appropriate name for the new version.
- Note the first three digits are the version number. For a proper release,
- the last digit is "1" (except for a rebuild without code changes which
- could be "2"). For pre-releases, the last digit should be "0" followed by
- "pre#" or "rc#".
- > git tag -a slurm-2-6-7-1 -m "create tag v2.6.7" OR
- > git tag -a slurm-2-7-0-0pre5 -m "create tag v2.7.0-pre5"
- > git push --tags
- 3. Use the rpm make target to create the new RPMs. This requires a .rpmmacros
- (.rpmrc for newer versions of rpmbuild) file containing:
- %_slurm_sysconfdir /etc/slurm
- %_with_debug 1
- %_with_sgijob 1
- NOTE: build will make a tar-ball based upon ALL of the files in your current
- local directory. If that includes scratch files, everyone will get those
- files in the tar-ball. For that reason, it is a good idea to clone a clean
- copy of the repository and build from that
- > git clone https://<user_name>@github.com/chaos/slurm.git <local_dir>
- Build using the following syntax:
- > build --snapshot -s <local_dir> OR
- > build --nosnapshot -s <local_dir>
- --nosnapshot will name the tar-ball and RPMs based upon the META file
- --snapshot will name the tar-ball and RPMs based upon the META file plus a
- timestamp. Do this to make a tar-ball for a non-tagged release.
- NOTE: <local_dir> should be a fully-qualified pathname
- 4. scp the files to schedmd.com in to ~/www/download/latest or
- ~/www/download/development. Move the older files to ~/www/download/archive,
- login to schedmd.com, cd to ~/download, and execute "php process.php" to
- update the web pages.
- BlueGene build notes:
- 0. If on a bgp system and you want sview export these variables
- > export CFLAGS="-I/opt/gnome/lib/gtk-2.0/include -I/opt/gnome/lib/glib-2.0/include $CFLAGS"
- > export LIBS="-L/usr/X11R6/lib64 $LIBS"
- > export CMD_LDFLAGS='-L/usr/X11R6/lib64'
- > export PKG_CONFIG_PATH="/opt/gnome/lib64/pkgconfig/:$PKG_CONFIG_PATH"
- 1. Use the rpm make target to create the new RPMs. This requires a .rpmmacros
- (.rpmrc for newer versions of rpmbuild) file containing:
- %_prefix /usr
- %_slurm_sysconfdir /etc/slurm
- %_with_bluegene 1
- %_without_pam 1
- %_with_debug 1
- Build on Service Node with using the following syntax
- > rpmbuild -ta slurm-...bz2
- The RPM files get written to the directory
- /usr/src/packages/RPMS/ppc64
- To build and run on AIX:
- 0. Get current copies of Slurm and buildfarm
- > git clone https://<user_name>@github.com/chaos/slurm.git
- > svn co https://eris.llnl.gov/svn/chaos/private/buildfarm/trunk buildfarm
- put the buildfarm directory in your search path
- > export PATH=~/buildfarm:$PATH
- Put the buildfarm directory in your search path
- Also, you will need several commands to appear FIRST in your PATH:
- /usr/local/tools/gnu/aix_5_64_fed/bin/install
- /usr/local/gnu/bin/tar
- /usr/bin/gcc
- I do this by making symlinks to those commands in the buildfarm directory,
- then making the buildfarm directory the first one in my PATH.
- Also, make certain that the "proctrack" rpm is installed.
- 1. Export some environment variables
- > export OBJECT_MODE=32
- > export PKG_CONFIG="/usr/bin/pkg-config"
- 2. Build with:
- > ./configure --enable-debug --prefix=/opt/freeware \
- --sysconfdir=/opt/freeware/etc/slurm \
- --with-ssl=/opt/freeware --with-munge=/opt/freeware \
- --with-proctrack=/opt/freeware
- make
- make uninstall # remove old shared libraries, aix caches them
- make install
- 3. To build RPMs (NOTE: GNU tools early in PATH as described above in #0):
- Create a .rpmmacros file specifying system specific files:
- #
- # RPM Macros for use with Slurm on AIX
- # The system-wide macros for RPM are in /usr/lib/rpm/macros
- # and this overrides a few of them
- #
- %_prefix /opt/freeware
- %_slurm_sysconfdir %{_prefix}/etc/slurm
- %_defaultdocdir %{_prefix}/doc
- %_with_debug 1
- %_with_aix 1
- %with_ssl "--with-ssl=/opt/freeware"
- %with_munge "--with-munge=/opt/freeware"
- %with_proctrack "--with-proctrack=/opt/freeware"
- Log in to the machine "uP". uP is currently the lowest-common-denominator
- AIX machine.
- NOTE: build will make a tar-ball based upon ALL of the files in your current
- local directory. If that includes scratch files, everyone will get those
- files in the tar-ball. For that reason, it is a good idea to clone a clean
- copy of the repository and build from that
- > git clone https://<user_name>@github.com/chaos/slurm.git <local_dir>
- Build using the following syntax:
- > export CC=/usr/bin/gcc
- > build --snapshot -s <local_dir> OR
- > build --nosnapshot -s <local_dir>
- --nosnapshot will name the tar-ball and RPMs based upon the META file
- --snapshot will name the tar-ball and RPMs based upon the META file plus a
- timestamp. Do this to make a tar-ball for a non-tagged release.
- 4. Test POE after telling POE where to find Slurm's LoadLeveler wrapper.
- > export MP_RMLIB=./slurm_ll_api.so
- > export CHECKPOINT=yes
- 5. > poe hostname -rmpool debug
- 6. To debug, set SLURM_LL_API_DEBUG=3 before running poe - will create a file
- /tmp/slurm.*
- It can also be helpful to use poe options "-ilevel 6 -pmdlog yes"
- There will be a log file create named /tmp/mplog.<jobid>.<taskid>
- 7. If you update proctrack, be sure to run "slibclean" to clear cached
- version.
- 8. Remove the RPMs that we don't want:
- rm -f slurm-perlapi*rpm slurm-torque*rpm
- and install the other RPMs into /usr/admin/inst.images/slurm/aix5.3 on an
- OCF AIX machine (pdev is a good choice).
- Debian build notes:
- Since Debian doesn't have PRMs, the rpmbuild program can not locate
- dependencies, so build without them by patching the build program:
- Index: build
- ===================================================================
- --- build (revision 173)
- +++ build (working copy)
- @@ -798,6 +798,7 @@
- $cmd .= " --define \"_tmppath $rpmdir/TMP\"";
- $cmd .= " --define \"_topdir $rpmdir\"";
- $cmd .= " --define \"build_bin_rpm 1\"";
- + $cmd .= " --nodeps";
- if (defined $conf{rpm_dist}) {
- my $dist = length $conf{rpm_dist} ? $conf{rpm_dist} : "%{nil}";
- $cmd .= " --define \"dist $dist\"";
- AIX/Federation switch window problems
- To clean switch windows: ntblclean =w 8 -a sni0
- To get switch window status: ntblstatus
- BlueGene bglblock boot problem diagnosis
- - Logon to the Service Node (bglsn, ubglsn)
- - Execute /admin/bglscripts/fatalras
- This will produce a list of failures including Rack and Midplane number
- <date> R<rack> M<midplane> <failure details>
- - Translate the Rack and Midplane to Slurm node id: smap -R r<rack><midplane>
- - Drain only the bad Slurm node, return others to service using scontrol
- Configuration file update procedures:
- - cd /usr/bgl/dist/slurm (on bgli)
- - co -l <filename>
- - vi <filename>
- - ci -u <filename>
- - make install
- - then run "dist_local slurm" on SN and FENs to update /etc/slurm
- Some RPM commands:
- rpm -qa | grep slurm (determine what is installed)
- rpm -qpl slurm-1.1.9-1.rpm (check contents of an rpm)
- rpm -e slurm-1.1.8-1 (erase an rpm)
- rpm --upgrade slurm-1.1.9-1.rpm (replace existing rpm with new version)
- rpm -i --ignoresize slurm-1.1.9-1.rpm (install a new rpm)
- For main Slurm plugin installation on BGL service node:
- rpm -i --force --nodeps --ignoresize slurm-1.1.9-1.rpm
- rpm -U --force --nodeps --ignoresize slurm-1.1.9-1.rpm (upgrade option)
- To clear a wedged job:
- /bgl/startMMCSconsole
- > delete bgljob ####
- > free RMP###
- Starting and stopping daemons on Linux:
- /etc/init.d/slurm stop
- /etc/init.d/slurm start
- Patches:
- - cd to the top level src directory
- - Run the patch command with epilog_complete.patch as stdin:
- patch -p[path_level_to_filter] [--dry-run] < epilog_complete.patch
- To get the process and job IDs with proctrack/sgi_job:
- - jstat -p
- CVS and gnats:
- Include "gnats:<id> e.g. "(gnats:123)" as part of cvs commit to
- automatically record that update in gnats database. NOTE: Does
- not change gnats bug state, but records source files associated
- with the bug.
- For memory leaks (for AIX use zerofault, zf; for linux use valgrind)
- - Run configure with the option "--enable-memory-leak-debug" to completely
- release allocated memory when the daemons exit
- - valgrind --tool=memcheck --leak-check=yes --num-callers=8 --leak-resolution=med \
- ./slurmctld -Dc >valg.ctld.out 2>&1
- - valgrind --tool=memcheck --leak-check=yes --num-callers=8 --leak-resolution=med \
- ./slurmd -Dc >valg.slurmd.out 2>&1 (Probably only one one node of cluster)
- - valgrind --tool=memcheck --leak-check=yes --num-callers=8 --leak-resolution=med \
- ./slurmdbd -D >valg.dbd.out 2>&1
- - Run the regression test. In the globals.local file include:
- "set enable_memory_leak_debug 1"
- - Shutdown the daemons using "scontrol shutdown"
- - Examine the end of the log files for leaks. pthread_create() and dlopen()
- have small memory leaks on some systems, which do not grow over time
- - Functions in the plugins will not have their symbols preserved when the
- plugin is unloaded and the function names will appear as "???" in the
- valgrind backtrace after shutdown. Rebuilding the daemons without the
- configure option of "--enable-memory-leak-debug" typically prevents the
- plugin from being unloaded so the symbols will be properly reported. However
- many memory leaks will be reported due to not unloading plugins. You will
- need to match the call sequence from the first log (with
- "--enable-memory-leak-debug") to the second log (without
- "--enable-memory-leak-debug" and ignore memory leaks reported that are not
- real leaks in the second log) to identify the full code path through plugins.
- Job profiling:
- - "export CFLAGS=-pg", then run "configure" and "make install" as usual.
- - Run the slurm daemons through a stress test and exit normally
- - Run "gprof [executable-file] >outfile"
- Before new major release:
- - Test on ia64, i386, x86_64, BGQ, NRT/POE, Cray
- - Test on Elan and IB switches
- - Test fail-over of slurmctld
- - Test for memory leaks in slurmctld, slurmd and slurmdbd with various plugins
- - Change API version number
- - Run "make check" (requires "dejagnu" package)
- - Test that the prolog and epilog run
- - Run the test suite with SlurmUser NOT being self
- - Test for errors reported by CLANG tool:
- NOTE: Run "configure" with "--enable-developer" option so assert functions
- take effect.
- scan-build -k -v make >m.sb.out 2>&1
- # and look for output in /tmp/scan-build-<DATE>
|