You are here

Excessive disk use in Rosetta download and build processes

8 posts / 0 new
Last post
Excessive disk use in Rosetta download and build processes
#1

The software I build and install gets pushed to standalone installations on about 100 different machines, and the installation usually lives on higher-cost storage (i.e. RAID or a SAN, etc), and then all those installations are backed up somewhere or other, so I usually take the time to weed out as many unnecessary files as possible. It makes everyone's life a little better.

Looking at the Rosetta packaging, there are a number of opportunities for improvements. The first one is very easy. Here's an expanded rosetta3.2_bundles.tgz:

$ du -k --max-depth=1 | sort -n
48 ./.svn
372 ./BioTools
1896 ./manual
70724 ./rosetta_demos
90188 ./foldit
556332 ./rosetta_source
738672 ./rosetta_database
1406972 ./rosetta_fragments
2865216 .
$ find . -type d -name '\.svn' | xargs rm -rf
$ du -k --max-depth=1 | sort -n
148 ./BioTools
1692 ./manual
34376 ./rosetta_demos
90192 ./foldit
269804 ./rosetta_source
367252 ./rosetta_database
703292 ./rosetta_fragments
1466768 .

So 50% of the disk usage in your tarball is used by files that the end-user will never use. It also uses more bandwidth on the download for both you and your users. Is there some reason you're leaving those in the download?

I built a static version of Rosetta using the Intel Compilers for linux. Yes, I know a dynamic version would use less disk space, but I support many different linux distributions. I used this command line:

./scons.py bin mode=release cxx=icc extras=static

Which works fine to create static binaries with the Intel compilers, but they don't have their debug symbols stripped:

$ file minirosetta.linuxiccrelease
minirosetta.linuxiccrelease: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, statically linked, for GNU/Linux 2.2.5, not stripped

Is the mode=release target not working for the Intel compilers? Why are the debug symbols still there?

Using extras=static also creates an identical binary with the word "static" in its name:

$ md5sum minirosetta.*
39c344f6a6b45f1280a3fc7fe80d5c70 minirosetta.linuxiccrelease
39c344f6a6b45f1280a3fc7fe80d5c70 minirosetta.static.linuxiccrelease

That seems superfluous since the path to the binaries includes the word static. Actually the way that every binary is suffixed with the operating system, compiler and build target seems superfluous given that information is all contained in the path to the binaries. It also invalidates all your documentation from the perspective of non-technical users trying to cut and paste example commands from the manual.

From a disk use perspective:

$ du -sk rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/
4336784 rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/
$ rm rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/*.static.*
$ strip rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/*.linuxiccrelease
$ du -sk rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/
2016292 rosetta_source/build/src/release/linux/2.6/32/x86/icc/static/

So greater than 50% disk savings by doing a few simple things.

Post Situation: 
Tue, 2011-02-01 11:44
bene

None of the developer labs are limited by disk space, so it's not really a consideration for us. If disk space and bandwidth became restrictively expensive, we'd optimize for it. As it is, developer time is vastly more expensive than anything else, so whatever makes development easier is what gets done. (Also, none of us use the release for anything, we use the development version.)

A) We haven't included the subversion files in the past. Assuming we included them on purpose this time, it's probably so users can revert changes locally instead of redownloading the code. It sure is convenient to SVN revert mistakes... (Um, it's also possible it was just an oversight, I don't know, I don't do the release.)

B) Probably 50% or more of the stuff in the rosetta_source directory is junk beyond the svn files you removed. There's a huge amount of testing data integrated into the codebase. Some of us have been trying to get it moved out but there's institutional inertia. Where are you such that you are administrating Rosetta across multiple platforms? Let me know if you want help removing some of that extra data to slim your Rosetta installs down and I'll be happy to point it out (big hint: everything in the test directory can probably go if you are managing it for the end users...)

C) I don't know enough about SCons to address it but I'll point out your concern to someone who does.

Tue, 2011-02-01 12:19
smlewis

Also, for the debug symbols in the intel/static/release build- yes, you're probably right, SCons is probably screwed up. We're SCons clients, not developers, so our use of it is certainly subtly wrong in many ways. If you can fix this I'll port the fix back to trunk.

You may be able to fix it by comparing the gcc/static/release to the intel/static/release options in tools/build/basic.settings.

Tue, 2011-02-01 12:23
smlewis

Hi bene,

Thanks for your careful investigation of the size of Rosetta.

It was an oversight to not strip out the .svn directories--which we'll take care of.

As for building two copies of the executables, this is a known bug that should be fixed soon. For now feel free to delete the extra copies.

As for stripping the symbols, we keep them in so users can get backtraces in the debugger should they run into a problem and want to try to figure out what is going on. If you are concerned about the executable size the best thing to do would be to call 'strip' on each executable

Best of luck,
Matt

Tue, 2011-02-01 16:58
momeara

Thanks for your comments and your offer of help, smlewis.

Thu, 2011-02-03 09:01
bene

Here's a list of things that aren't required across all installations (you should keep them on the master installation). deleting these might pare it down by half; it won't affect the size of the compiled code (which is waaaaay bigger than the uncompiled code anywy...)

foldit
manual (consider your users, might want to keep)
rosetta_demos (consider your users, might want to keep)
rosetta_fragments

in rosetta_source:
test (310 MB! more than src! ugh!)
stubs
analysis

in rosetta_source/src:
rdwizard
integration
python (probably)

in rosetta_source/src/apps/
benchmark
curated
pilot

Fri, 2011-02-04 06:59
smlewis

just a me too for reducing the release footprint of rosetta.

Wed, 2011-04-06 08:06
tru

The svn files were removed for 3.2.1, just so everybody knows.

Wed, 2011-04-06 17:37
smlewis