ruby Net::SFTP put / upload file hang
There are quite a few posts around regarding problems with upload of files using the ruby Net::SFTP library written by Jamis Buck. You can see a smattering by googling for terms like ruby net::sftp upload hang. Several of them are in the context of problems with uploads under Capistrano as well as a couple of blog posts about potential workarounds, such as this one.
I decided to spend some time with my head in the Netbeans ruby debugger to see if I could get a glimpse more at the root of the cause. I didn’t set out to solve this or anything… because I figure if the author of the libraries himself hasn’t figured it out (or been able to make it consistently repro) then I probably don’t have much hope doing so. Not only that, but I can’t spend a week learning all the underlying stuffs to attempt to fix this bug in our production application. So a work around we shall have.
I did, however, make some interesting observations which may or may not be of use to someone else looking into the problem. First, some repro oddities, in no particular order:
- I have been unable to repro in a primitive case with a toy script to simply upload a file.
- I can consistently repro with “larger” files; works like a charm for smaller files. Unfortunately, no pattern consistently around CHUNK_SIZE seems to be evident. (See next item.)
- I can make it work everytime if I reduce CHUNK_SIZE in your-net-sftp-code-dir/lib/net/sftp/operations/write.rb from 32 * 1024 to “something smaller.” I’ve made it work with 1024, 4096, 16384, etc. I mucked with this value after finding this thread on cpanforum about a potentially similar sounding issue with Perl’s Net::SFTP library.
- The hosts in question don’t seem to matter. Sometimes remote, sometimes more proximal, sometimes localhost.
So here’s some code to look at (if I can make this fucking wordpress editor not destroy it). This is from your-net-sftp-code-dir/lib/net/sftp/operations/abstract.rb:
# Execute the operation. If a callback is given, the operation will be
# performed asynchronously with the callback being invoked when the
# operation completes. If a callback is not given, the operation will
# be performed synchronously, with the expected value being returned.
def execute( *args, &callback )
@log.debug "executing" if @log.debug?
.
unless block_given?
status = result = nil
callback = Proc.new do |status, *pargs|
status = status
result = case
when pargs.empty? then @session.status
when pargs.length == 1 then pargs.first
else pargs
end
end
end
.
@callback = callback
@id = perform *args
@log.debug "received request id #{@id}"
@session.register( @id, self )
.
unless block_given?
i = 0
@session.loop { status.nil? }
@session.status = status
return result
end
end
What I’ve discovered via the debugger is that, in a case where the problem repros, the callback is never invoked and it hangs inside of @session.loop(). The callback should actually be invoked inside of the do_status() function of the operation being performed. In this case, from write.rb:
# Invoked when the server sends a status packet. If the status is FX_OK,
# then the callback is invoked (if all data has been written), or the
# next chunk is written to the server (if more data remains). Other
# status codes are handled by the superclass.
def do_status( code, message, language )
if code == FX_OK
@log.debug "[#{@id}] chunk written” if @log.debug?
@pos += CHUNK_SIZE
.
if @pos > @data.length
@callback[ OK ]
return
end
.
@driver.write @id, @handle, @offset + @pos, @data[@pos,CHUNK_SIZE]
@session.register @id, self
else
super
end
end
I’m guessing that the FX_OK status packet is indeed making it back, but something is happening in the libraries such that the callback from the lower layer is not being invoked. When the callback isn’t invoked, the status is always nil and so we sleep in the @session.loop(). That’s our hang.
Ok… now someone fix it.
Login incorrect; valid user, no password prompt
I recently released a linux machine using the Samba/Winbind combo from the grips of an active directory server - a product I absolutely detest for a myriad of reasons which I won’t go into here.
So… in my haste, I forgot to do something and ended up with the following symptoms on a tty login:
Debian GNU/Linux lenny/sid slard tty1
slard login:
So I try to login…
Debian GNU/Linux lenny/sid slard tty1
slard login: root
LOGIN INCORRECT
Eh? What gives?
Turns out I forgot to fix my pam configuration under /etc/pam.d/common-password.
The solution? Assuming it’s your PAM configuration, boot into single user mode* and make sure your PAM configuration isn’t borked. I merely switched back to the default config:
password required pam_unix.so nullok obscure min=4 max=8 md5
*Not sure how to boot into single user mode? If you’re using grub:
Debian automatically adds a line for each kernel installed via debs that includes the normal kernel boot as well as a single user. If you don’t have that for whatever reason, just do it manually. At the grub menu, select the kernel you want to boot into and choose ‘e’ to edit the config. On the ‘kernel’ line, simply add the token ’single’. Then press ‘b’ to boot. You’ll boot into single-user mode; run level 1.