Abnormal termination of running CUDA program

Hello, I am running the cricket project in Nanos to call remote GPU resources in Unikernel, but a fatal error has occurred and I don’t know how to solve it. Here is my detailed build process.

  1. Compile cricket

    git clone https://github.com/RWTH-ACS/cricket.git
    cd cricket && git submodule update --init
    LOG=INFO make
    

    The build product is located in the bin directory

  2. Build a Nanos image
    The project is structured as follows:

    .
    ├── bin
    │   ├── cricket-client.so
    │   ├── cricket-rpc-server
    │   ├── cricket-server.so
    │   ├── libtirpc.so
    │   ├── libtirpc.so.3
    │   └── tests
    │       ├── api.testapp
    │       ├── bandwidthTest.sample
    │       ├── cpu.testapp
    │       ├── cricket.testapp
    │       ├── kernel.testapp
    │       ├── matrixMul.compressed.sample
    │       ├── matrixMul.uncompressed.sample
    │       ├── mnistCUDNN.sample
    │       ├── nbody.compressed.sample
    │       ├── nbody.uncompressed.sample
    │       ├── test_list.test
    │       └── test_resource_mg.test
    ├── config.json
    ├── etc
    │   └── netconfig
    ├── lib
    │   └── x86_64-linux-gnu
    │       ├── cricket-client.so
    │       ├── libcudart.so.12
    │       ├── libdl.so.2
    │       ├── libelf.so.1
    │       ├── librt.so.1
    │       ├── libtirpc.so.3
    │       └── libz.so.1
    ├── proc
    │   └── self
    │       └── comm
    └── start.sh
    

    config.json

    {
        "MapDirs":{
            "./etc/*":"/etc",
            "./lib/*":"/usr/lib",
            "./proc/*":"/proc"
        },
        "Env":{
            "REMOTE_GPU_ADDRESS":"192.168.1.63",
            "LD_PRELOAD":"/usr/lib/x86_64-linux-gnu/cricket-client.so"
        }
    }
    

    start.sh

    appName='cricket.testapp'
    echo $appName > ./proc/self/comm
    ops run bin/tests/$appName -c config.json -b -t tap0 --ip-address 192.168.1.166
    
  3. Run the program
    Run the cricket server on a server with a Nvidia GPU and CUDA environment

    ./bin/cricket-rpc-server
    
    welcome to cricket!
    +08:00:00.000004 INFO:  using TCP...
    +08:00:00.056793 INFO:  listening on port 64172
    +08:00:00.323820 INFO:  waiting for RPC requests...
    

    Run Nanos instance

    ./start.sh
    

    And then there was a mistake.

    running local instance
    booting /root/.ops/images/cricket ...
    en1: assigned 192.168.1.166
    +00:00:00.000389 INFO:  connection to host "192.168.1.63"
    +00:00:00.011436 INFO:  connecting via TCP...
    en1: assigned FE80::64AC:D2FF:FE0E:B2EE
    
    *** signal 11 received by tid 2, errno 0, code 1
        fault address 0x0
    
    *** Thread context:
    lastvector: 000000000000000e (Page fault)
         frame: ffffc00002a02000
          type: thread
    active_cpu: 00000000ffffffff
     stack top: 0000000000000000
    error code: 0000000000000004
       address: 0000000000000000
    
       rax: 0000000000000000
       rbx: 0000000000676c40
       rcx: 0000000000000000
       rdx: 0000000000000001
       rsi: 00000000000026d0
       rdi: 0000000000674570
       rbp: 0000000000000001
       rsp: 00000000ffd7e9b0
        r8: 0000000000000000
        r9: 0000000000000001
       r10: fffffffffffff8fa
       r11: 0000008c21f68c70
       r12: 00000000000026d0
       r13: 0000000000000000
       r14: 00000000ffd7eda8
       r15: 0000000000679cd8
       rip: 0000008c21f68c95
    rflags: 0000000000010202
        ss: 000000000000002b
        cs: 0000000000000023
        ds: 0000000000000000
        es: 0000000000000000
    fsbase: 0000000100b61000
    gsbase: 0000000000000000
    
    frame trace:
    
    loaded klibs: 
    
    stack trace:
    00000000ffd7e9b0:   0000000000679cd8
    00000000ffd7e9b8:   0000000000676c40
    00000000ffd7e9c0:   0000000000674570
    00000000ffd7e9c8:   00000000000026d0
    00000000ffd7e9d0:   0000000000000000
    00000000ffd7e9d8:   00000000ffd7eda8
    00000000ffd7e9e0:   0000000000679cd8
    00000000ffd7e9e8:   000000f117489f50
    00000000ffd7e9f0:   00000000ffd7ea54
    00000000ffd7e9f8:   00000039e87aee56
    00000000ffd7ea00:   1c000080e87bbf60
    00000000ffd7ea08:   1dcd29f6f5a0d200
    00000000ffd7ea10:   000000f1174f4c20
    00000000ffd7ea18:   1dcd29f6f5a0d200
    00000000ffd7ea20:   00000000ffd7eb50
    00000000ffd7ea28:   00000039e87adba9
    00000000ffd7ea30:   0000000000000000
    00000000ffd7ea38:   1dcd29f6f5a0d200
    00000000ffd7ea40:   00000000ffd7eb48
    00000000ffd7ea48:   00000039e87ae387
    00000000ffd7ea50:   0000000000000000
    00000000ffd7ea58:   0000000000000000
    00000000ffd7ea60:   00000000ffd7eb58
    00000000ffd7ea68:   00000039e87adb19
    00000000ffd7ea70:   0000000000000000
    00000000ffd7ea78:   000000000097c8c0
    00000000ffd7ea80:   000000000097c928
    00000000ffd7ea88:   00000039e87bbf60
    00000000ffd7ea90:   00000000ffd7eb40
    00000000ffd7ea98:   00000039e87adab0
    00000000ffd7eaa0:   00000000ffd7eb30
    00000000ffd7eaa8:   00000039e87a18c9
    
       core dump
    

Thanks for the detailed info. Can you run ops with ‘–trace’ ? Before this happens you’ll see some output which will point at what it is failng on.

1 Like
    2 direct return: 0, rsp 0xffeec990
    2 rt_sigprocmask
    2 direct return: 0, rsp 0xffeec990
    2 getsockname
    2 direct return: 0, rsp 0xffeecc08
    2 rt_sigaction
    2 direct return: 0, rsp 0xffeeca60
    2 rt_sigprocmask
    2 direct return: 0, rsp 0xffeeca10
    2 write
    2 direct return: 56, rsp 0xffeeca58
    2 poll
    2 direct return: 1, rsp 0xffeec978
    2 read
    2 direct return: 32, rsp 0xffeec978
    2 rt_sigprocmask
    2 direct return: 0, rsp 0xffeeca10
    2 openat
    2 "/tmp/cricket-elf-dump" - not found
    2 direct return: -2, rsp 0xffeec8b0
    2 thread_attempt_interrupt: tid 2
    2    uninterruptible or already running
    2 signal 11 received, errno 0, code 1
    2    fault address 0x0
    2    default action

*** signal 11 received by tid 2, errno 0, code 1
    fault address 0x0

*** Thread context:
lastvector: 000000000000000e (Page fault)
     frame: ffffc00002a02000
      type: thread
active_cpu: 00000000ffffffff
 stack top: 0000000000000000
error code: 0000000000000004
   address: 0000000000000000

It looks like the /tmp/cricket-elf-dump file is missing. After testing outside Nanos, this file should have been generated when the elf file was running, but it was not produced in Nanos.

I created an empty tmp directory and mapped it. It worked successfully. Thank you very much.

1 Like