Unit test compilation failure on Rosetta 3.5+ with Intel icpc

8 posts / 0 new
Unit test compilation failure on Rosetta 3.5+ with Intel icpc
#1

Hello,

I'm attempting a complete build with intel/14.0.1.106 + mvapich2/2.0b compiler and MPI libraries. I'd like to be able to compile and run the unit tests as well as the integration tests to serve as a base level benchmark for performance improvements on our system.

#### Part of my configuration file ############################
cat > site.settings.stampede <<EOF
...snip
import os

settings = {
"site" : {
"prepends" : {
# Location of standard and system binaries
"program_path" : os.environ["PATH"].split(":"),
"library_path" : os.environ["LD_LIBRARY_PATH"].split(":"),
"include_path" : os.environ["INCLUDE"].split(":"),
},
"appends" : {
"flags" : {
"compile" : ["mkl"],
"warn" : [ "wd1684", "wd592" ]
},
},
"overrides" : {
"cxx" : "mpicxx",
"cc" : "mpicc",
},
"removes" : {
},
}
}
EOF
rm site.settings
ln -s site.settings.stampede site.settings
cd ../../

COMPILER=icc

echo -e "LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" echo -e "PATH=${PATH}"

MODE=debug
EXTRAS=mpi

./scons.py -c
rm .sconsign.dblite
./scons.py -j16 mode=${MODE} extras=${EXTRAS} cxx=${COMPILER} bin ./scons.py -j16 mode=${MODE} extras=${EXTRAS} cxx=${COMPILER} cat=test

###### End of relevant part of my configuration file #######################

The code build seems to complete without error.
The unit test build errors before completion.

############ Unit test compilation error ##############################

mpicxx -o build/test/debug/linux/2.6/64/x86/icc/14.0/mpi/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.o -c -Wp64 -wd279,2259,1682 -O0 -g -mkl -wd1684 -wd592 -DBOOST_ERROR_CODE_HEADER_ONLY -DBOOST_SYSTEM_NO_DEPRECATED -DUSEMPI -Iexternal/cxxtest -I. -Itest -Isrc -Iexternal/include -Isrc/platform/linux/64/icc/14.0 -Isrc/platform/linux/64/icc -Isrc/platform/linux/64 -Isrc/platform/linux -I/opt/apps/intel/13/composer_xe_2013_sp1.1.106/ipp/include -I/opt/apps/intel/13/composer_xe_2013_sp1.1.106/mkl/include -Iexternal/boost_1_55_0 -Iexternal/dbio -I/usr/include -I/usr/local/include build/test/debug/linux/2.6/64/x86/icc/14.0/mpi/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.cpp

src/core/pack/interaction_graph/SurfaceInteractionGraph.hh(1957): error: incomplete type is not allowed
core::scoring::TenANeighborGraph const & tenA_neighbor_graph( get_surface_owner()->pose().energies().tenA_neighbor_graph() );
^
detected during:
instantiation of "void core::pack::interaction_graph::SurfaceNode<V, E, G>::initialize_num_neighbors_counting_self() const [with V=core::pack::interaction_graph::LinearMemNode, E=core::pack::interaction_graph::LinearMemEdge, G=core::pack::interaction_graph::LinearMemoryInteractionGraph]" at line 1981
instantiation of "int core::pack::interaction_graph::SurfaceNode<V, E, G>::num_neighbors_counting_self() const [with V=core::pack::interaction_graph::LinearMemNode, E=core::pack::interaction_graph::LinearMemEdge, G=core::pack::interaction_graph::LinearMemoryInteractionGraph]" at line 3330
instantiation of "void core::pack::interaction_graph::SurfaceInteractionGraph<V, E, G>::initialize(const core::pack::rotamer_set::RotamerSetsBase &) [with V=core::pack::interaction_graph::LinearMemNode, E=core::pack::interaction_graph::LinearMemEdge, G=core::pack::interaction_graph::LinearMemoryInteractionGraph]"

############### End unit test compilation error ###################################

I have attempted builds with other versions of the Intel compiler and also other SCons settings for Rosetta (omp,mpi,default,release,debug,etc.) and the compilation consistently errors on src/core/pack/interaction_graph/SurfaceInteractionGraph.hh.

Before descension into template hell, I am hoping that you all can help to get this cleared up.
I'll withhold my thoughts as to what's happening so as to not cloud judgement from you all.

I have tried it with GCC. It compiles without error. I'm looking to use specific features that tie in the Intel software stack with Intel specific hardware. So, the option of "just use GCC" is not appropriate. I have tried Rosetta 3.5 as well as the rosetta_2014.35.57232_bundle. Both fail in the same manner.

Let me know if there is more information I can provide that is relevant.
Thanks,
Cyrus

Category:
Post Situation:
Mon, 2014-09-29 11:52
sidio47

Keep in mind that the unit test are just for testing that things are working correctly - they're not needed for actually running Rosetta for scientific purposes.

I'm not sure why things are erroring out. I was thinking there was some implicit header dependency that gcc was picking up that icc isn't, but I can't figure out what it is.

Honestly, if I was in your situation, I would probably just comment out lines 27 and 161-200 in test/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.hh (This is the SurfaceInteractionGraph include and the test_LinearMemorySurfaceInteractionGraph() and test_PDSurfaceInteractionGraph() functions)

The SurfaceInteractionGraph is a somewhat specialized usage, and it's unlikely that you're going to be using it (you would only use it if the "surface" score term is non-zero in your energy function, which it isn't in the common scorefunctions.), so it doesn't matter too much whether it tests cleanly or not, and commenting out those lines should avoid the compilation issues.

Mon, 2014-09-29 12:26
rmoretti

Thanks for the quick response. Please understand this comment is being made from outside the developer circle and is meant with all due respect for all the efforts that you all put forth.

I understand that I can comment out specific unit tests on parts of the code that are "rarely" used. This build ultimately is not for me. It's for all of the users of our computing center so I cannot begin to know what their needs may be. I would certainly feel more confident in the results of this scientific code if the unit tests that ship with release are indeed capable of compiling and producing expected results.

To "meet you halfway" so to speak, I commented out the sections you refer to. I also comment out lines 28-40 of a generated code file build/test/debug/linux/2.6/64/x86/icc/14.0/mpi/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.cpp such that the tests don't run. If I do this, I open another can of worms with a similar error as before:

################# Next Error ########################

mpicxx -o build/test/debug/linux/2.6/64/x86/icc/14.0/mpi/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.o -c -Wp64 -wd279,2259,1682 -O0 -g -mkl -wd1684 -wd592 -DUSEMPI -Iexternal/cxxtest -I. -Itest -Isrc -Iexternal/include -Isrc/platform/linux/64/icc/14.0 -Isrc/platform/linux/64/icc -Isrc/platform/linux/64 -Isrc/platform/linux -I/opt/apps/intel/13/composer_xe_2013_sp1.1.106/ipp/include -I/opt/apps/intel/13/composer_xe_2013_sp1.1.106/mkl/include -Iexternal/boost_1_46_1 -Iexternal/dbio -I/usr/include -I/usr/local/include build/test/debug/linux/2.6/64/x86/icc/14.0/mpi/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.cpp
src/utility/pointer/owning_ptr.functions.hh(43): error: pointer to incomplete class type is not allowed
p->remove_ref();
^
detected during:
instantiation of "void utility::pointer::owning_ptr_release(T *) [with T=core::graph::Graph]" at line 131 of "src/utility/pointer/owning_ptr.hh"
instantiation of "utility::pointer::owning_ptr<T>::~owning_ptr() [with T=core::graph::Graph]" at line 72 of "./test/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.hh"

################## End of Next Error ################################

This one may be simpler to diagnose by the looks of the src/utility/pointer/owning_ptr.functions.hh

Ultimately, I would rather not be going in and manually patching the code. If there are specifics that allow workarounds, I'm happy to help explore those options. In general though, trying to build/maintain/support a code that compiles and produces a starting baseline is infinitely easier for everyone.

Mon, 2014-09-29 13:50
sidio47

Yeah, I get where you're coming from.

Unfortunately, I don't quite understand why you're seeing the errors you are, especially when you saw that the main library built perfectly fine.

I don't have much experience with the Intel compiler, and I asked around and no one in the Rosetta developer community came forward saying that they've built the unit tests with the Intel compiler, so we have limited experience on our side with this. In your first message you hinted that you had ideas as to what might be going on. What were they? As you know the quirks of your compiler version better than I do, fronting a suggestion could spark ideas from the Rosetta end on how to address the issues.

On the shot-in-the-dark end, one thing you may want to try if you haven't already is to do a clean compile. Either delete the contents of the build/ directory and try recompiling (both main library and unit tests), or start from scratch with the original source tarball. I don't know if something happened to your build that would provoke this. (It sound a bit doubtful, to me, too, but you never know ...)

Alternatively, as the main library and applications apparently built fine, you could just forgo building the unit tests altogether and just deploy without unit testing, assuming everything will be fine. To be honest, my understanding is that most users go this route. The unit test are mainly there to test Rosetta during development, to ensure that the many people who are developing Rosetta don't inadvertently break some other part. We don't ship releases (weekly or otherwise) with broken unit tests, and a correctly-compiled version of Rosetta wouldn't change the results of the tests.

An alternate way of testing things is to run the integration tests. Go to main/tests/integration/ and run ./integration.py (Run with just -h to see the options - there's --mode --compiler and --extras flags to control which version of the executables are being used.) This is set up to be run as a before/after comparison when developing, but you can run it as a one-off, and then look for files like ".test_did_not_run.log" or ".test_got_timeout_kill.log" under the ref/ or new/ directories to locate tests which did not run properly. If you don't see any of them, things are probably running okay.

Mon, 2014-10-06 09:53
rmoretti

Also, try to compile the unit tests without MPI. Typically, the unit tests are compiled and run using normal debug mode with the script test/run.py ran from the source directory. So, you would want a non-mpi debug build of source and then then the non-mpi compilation of the tests. You can specify -j to up the number of processors used to run the tests - other options of course from ./test/run.py --help. I don't know if this would fix your compile errors, but it may be worth a shot.

Tue, 2014-10-07 11:20

Thanks for the suggestions. I tried a clean install and also compiling with MPI disabled for the unit tests. The errors remain.

To address rmoretti's points: Although I wish that we could rely on different compilers to give the same answers, we simply can't and it would be naive to assume this, especially on different machine architectures. Each compiler group decides and interprets the language standards differently and while, as a whole, we should expect consistent behavior, on some of the more esoteric decisions, we can end up with nuanced interpretations with very different behavior. Particularly if you all don't develop or test with the Intel compiler, then unit testing is critical to having any sort of confidence that the scientific code is going to produce the results you all intended.

To address rmoretti's question about my 'suspicion', I believe a decent similar example is given at this stack overflow conversation: https://stackoverflow.com/questions/12860199/error-pointer-to-incomplete...
Disclaimer: Please keep in mind that I am unfamiliar with the Rosetta code and that this interpretation may lead to a very wrong conclusion about the root of the issue.
In essence, I'm guessing that the Intel compiler is treating the pointer to get_surface_owner()->pose() more strictly than GCC. What I mean is that, this call to get_surface_owner()->pose().energies().tenA_neighbor_graph() seems to be instantiating a TenANeighborGraph graph::Graph*. The compiler is telling us that it cannot correctly determine if the pose() member function is going to be available. As the post suggests, there could be several plausible scenarios with issues relating to include guards, forward declarations, namespace issues, no declarations or multiple declarations. From perusing the source code and seeing some of the colorful language (src/core/pose/Pose.hh comes to mind) with respect to adhering to proper include strategies, this seems a likely avenue to explore.

The author of src/core/pack/interaction_graph/SurfaceInteractionGraph.hh is Ron Jacak and he may be the best person to contact with regards to this next bit.
For lines 1934 - 1955 and 2533 - 2554, I see commented out sections of code that compute num_neighbors_counting_self_ which is the initalizer that finds the number of atoms within some radius. It appears that the TenANeighborGraph can provide this information, so he is making a call to it instead -- thus hopefully simplifying the code and not recomputing what is already known. There is also specifically a debug statement attached to this section, which leads me to believe that this section has recently had an overhaul and the developers are still determining if it indeed has the correct behavior. I can uncomment the mentioned sections and comment out the TenANeighborGraphs instead and have a successful compilation of the unit tests -- because pose.energies is not being called. Now, I don't know if this commented code will produce the answer it is supposed to -- I have to assume that because it is commented the answer is conservatively no. Moving on, the results of the unit tests with Intel are:
-------- Unit test summary --------
Total number of tests: 1201
number tests passed: 1196
number tests failed: 5
failed tests:
core.test: SurfaceInteractionGraphTests:test_consider_substitution
core.test: SurfaceInteractionGraphTests:test_bg_node_2_resid
core.test: SurfaceInteractionGraphTests:test_commit_substitution
core.test: SurfaceInteractionGraphTests:test_blanket_reset_alt_state_counts
core.test: SurfaceInteractionGraphTests:test_get_energy_current_state_assignment
Success rate: 99%
---------- End of Unit test summary
Done!

Which tells me that the old commented out code, while it compiles, is producing incorrect answers.
I would appreciate some feedback about how to properly count the number of neighbor atoms without using an incomplete class definition.

Tue, 2014-10-21 12:34
sidio47

One way you could narrow down the issue a little is by breaking out the line in question, so the different objects are on different lines. Try replacing it with something like:

SurfaceInteractionGraph< V, E, G >* surf_owner_ptr( get_surface_owner() );
assert( surf_owner_ptr );
SurfaceInteractionGraph< V, E, G > const & surf_owner( *surf_owner_ptr );
core::pose::Pose const & debug_pose( surf_owner.pose() );
core::scoring::Energies const & debug_energies( debug_pose.energies() );
core::scoring::TenANeighborGraph const & tenA_neighbor_graph( debug_energies.tenA_neighbor_graph() );

If it's one of the intermediate objects which is incompletely specified, then we should get an error on another line of the code. (One would hope that the little caret would point to the incomplete class/object in question, but it looks like that might not be the case for the Intel compiler.)

As it turns out, when I try it with gcc, I get an error about about an incomplete class type for core::scoring::Energies, even though the all on one line version works fine. Adding "#include <core/scoring/Energies.hh>" to the top of SurfaceInteractionGraph.hh fixes the error. (Why the all-one-line version works, I have no idea.)

Please let me know if it fixes your issue, or if not, which line of the exploded statement the Intel compiler is choking on now.

Wed, 2014-10-22 16:10
rmoretti

Substituting in the intermediate steps as rmoretti above gives:
###########################
mpicxx -o build/test/debug/linux/2.6/64/x86/icc/13.1/default/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.o -c -Wp64 -wd279,2259,1682 -O0 -g -mkl -wd1684 -wd592 -Iexternal/cxxtest -I. -Itest -Isrc -Iexternal/include -Isrc/platform/linux/64/icc/13.1 -Isrc/platform/linux/64/icc -Isrc/platform/linux/64 -Isrc/platform/linux -I/opt/apps/intel/13/composer_xe_2013.2.146/ipp/include -I/opt/apps/intel/13/composer_xe_2013.2.146/mkl/include -Iexternal/boost_1_46_1 -Iexternal/dbio -I/usr/include -I/usr/local/include build/test/debug/linux/2.6/64/x86/icc/13.1/default/core/pack/interaction_graph/InteractionGraphFactory.cxxtest.cpp
src/core/pack/interaction_graph/SurfaceInteractionGraph.hh(1963): error: incomplete type is not allowed
core::scoring::TenANeighborGraph const & tenA_neighbor_graph( debug_energies.tenA_neighbor_graph() );
^
detected during:
instantiation of "int core::pack::interaction_graph::SurfaceNode<V, E, G>::num_neighbors_counting_self() const [with V=core::pack::interaction_graph::LinearMemNode, E=core::pack::interaction_graph::LinearMemEdge, G=core::pack::interaction_graph::LinearMemoryInteractionGraph]" at line 3340
###########################
Note that the caret "^" symbol lies below debug_energies and is not the compiler but likely the forum behavior (my whitespace is removed and even when trying to put it in there I think the non-monospaced font doesn't allow the caret to line up properly).

Ultimately, rmoretti nailed it. The #include <core/scoring/Energies.hh> solves both cases. Moreover, from the unit tests:
-------- Unit test summary --------
Total number of tests: 1201
number tests passed: 1201
number tests failed: 0
Success rate: 100%
---------- End of Unit test summary

So, for completeness:
Rosetta-3.5 compiles and passes all unit tests using compilers intel/14.0.1.106 and intel/13.0.2.146 with the addition of "#include <core/scoring/Energies.hh>" to rosetta_source/src/core/pack/interaction_graph/SurfaceInteractionGraph.hh.

Perhaps this would be a good addition to future releases?

Thanks to all of you that contributed, I appreciate your efforts.
For now, that's all I've got.

Mon, 2014-10-27 07:37
sidio47