ruby Net::SFTP put / upload file hang
There are quite a few posts around regarding problems with upload of files using the ruby Net::SFTP library written by Jamis Buck. You can see a smattering by googling for terms like ruby net::sftp upload hang. Several of them are in the context of problems with uploads under Capistrano as well as a couple of blog posts about potential workarounds, such as this one.
I decided to spend some time with my head in the Netbeans ruby debugger to see if I could get a glimpse more at the root of the cause. I didn’t set out to solve this or anything… because I figure if the author of the libraries himself hasn’t figured it out (or been able to make it consistently repro) then I probably don’t have much hope doing so. Not only that, but I can’t spend a week learning all the underlying stuffs to attempt to fix this bug in our production application. So a work around we shall have.
I did, however, make some interesting observations which may or may not be of use to someone else looking into the problem. First, some repro oddities, in no particular order:
- I have been unable to repro in a primitive case with a toy script to simply upload a file.
- I can consistently repro with “larger” files; works like a charm for smaller files. Unfortunately, no pattern consistently around CHUNK_SIZE seems to be evident. (See next item.)
- I can make it work everytime if I reduce CHUNK_SIZE in your-net-sftp-code-dir/lib/net/sftp/operations/write.rb from 32 * 1024 to “something smaller.” I’ve made it work with 1024, 4096, 16384, etc. I mucked with this value after finding this thread on cpanforum about a potentially similar sounding issue with Perl’s Net::SFTP library.
- The hosts in question don’t seem to matter. Sometimes remote, sometimes more proximal, sometimes localhost.
So here’s some code to look at (if I can make this fucking wordpress editor not destroy it). This is from your-net-sftp-code-dir/lib/net/sftp/operations/abstract.rb:
# Execute the operation. If a callback is given, the operation will be
# performed asynchronously with the callback being invoked when the
# operation completes. If a callback is not given, the operation will
# be performed synchronously, with the expected value being returned.
def execute( *args, &callback )
@log.debug "executing" if @log.debug?
.
unless block_given?
status = result = nil
callback = Proc.new do |status, *pargs|
status = status
result = case
when pargs.empty? then @session.status
when pargs.length == 1 then pargs.first
else pargs
end
end
end
.
@callback = callback
@id = perform *args
@log.debug "received request id #{@id}"
@session.register( @id, self )
.
unless block_given?
i = 0
@session.loop { status.nil? }
@session.status = status
return result
end
end
What I’ve discovered via the debugger is that, in a case where the problem repros, the callback is never invoked and it hangs inside of @session.loop(). The callback should actually be invoked inside of the do_status() function of the operation being performed. In this case, from write.rb:
# Invoked when the server sends a status packet. If the status is FX_OK,
# then the callback is invoked (if all data has been written), or the
# next chunk is written to the server (if more data remains). Other
# status codes are handled by the superclass.
def do_status( code, message, language )
if code == FX_OK
@log.debug "[#{@id}] chunk written” if @log.debug?
@pos += CHUNK_SIZE
.
if @pos > @data.length
@callback[ OK ]
return
end
.
@driver.write @id, @handle, @offset + @pos, @data[@pos,CHUNK_SIZE]
@session.register @id, self
else
super
end
end
I’m guessing that the FX_OK status packet is indeed making it back, but something is happening in the libraries such that the callback from the lower layer is not being invoked. When the callback isn’t invoked, the status is always nil and so we sleep in the @session.loop(). That’s our hang.
Ok… now someone fix it.
Cool, great analysis! This could be like a chain in the Ruby community on whose end someone will fix the bug:)
Did you let Jamis know? If you can reproduce it, it might help him a lot to fix it..
Btw what was your workaround?