Archive for the ‘geek’ Category

Almost another year!

I’d forgotten this blog.

Long story short: my linux box is stable. Ended up being two dodgy sticks of ram. Stable for the last… 8 months or so now.

In gaming-land Minecraft has stolen my time. Between it and my occasional logging into WoW, I haven’t played anything else in months. Not since I finished Portal 2 within a week of its release.

I’m 5 months into the process of migrating from one VPS to another. I got Linode Xen based VM in December on a special deal…. jagged a $100 credit on my account, which is only just about to expire. My “new” VPS is Debian 5 based, rather than the Centos one I had no option with on the “current” VPS.

I’ve been very impressed with Linode. 100% uptime on my VM until a few days ago when their Fremont California datacentre was affected by power issues from their utilities provider. My VM was only down for a few hours all up, not an issue for me. Also lots of cool stuff in their management dashboard. You can clone/stop/start/modify VMs with ease. Not to mention the virtual KVM solution you can SSH to, very cool if you forget to set your default stargateway etc.

One day I’ll get around to migrating DNS and secondary MX duties to the new VM, and shutting down the old one. That is… unless I migrate my email over to Google Apps and do away with running my own mail server. So far my tin foil hat tendencies have stopped me doing this.

The Neverending Story

Its been a few months since I last posted saying I’d fixed my crashing linux box issue. As it turns out, I hadn’t fixed it.

My server was stable for a very long time… then started having issues again. After everything I’d done, I went out and bought a better motherboard and installed it. Seemed good…. a week later issues crept in again.

Then I noticed the kernel modules installed for Virtualbox were for the Open Source Edition, not the version I was running. So I nuked them and installed the correct ones. All good for another week.

And the issues came back.

So I ran memtest86 for a few hours on the RAM (I did this months ago too), with no issues. Put the RAM back in, ran the linux box till it had issues again. I immediately rebooted into memtest86… and got errors.

I’ve removed the two 2gig DIMMs that I had been using, and dropped in a single 2gig DIMM I yanked from my HTPC. Its been about a month since then, and its been ok. Problem is I’ve gone 3 months without issues before…. so unless this thing is rock solid until Christmas, I’m not counting it as fixed. I’ve learnt my lesson.

In other news, both of the NICs I have in the server have issues under linux. The first uses a Realtek 8169 chipset… which has no drivers available that I can find. So I can’t use that one. The other users the sky2 module. Googling tells me lots of users with cards using this module that have the exact same issue I do. Specifically that when under very high throughput load I get this “NETDEV WATCHDOG: eth0: transmit timed out”. Further googling found me a dodgy script someone had put together that may or may not resolve the issue. It basically downs eth0, removes the module, reinserts the modules and ups the interface again. I haven’t had time to force the problem to occur so I can test if the script actually restores network traffic or not.

There you have it. As it stands I’m pretty happy with the server… its been stable for a while now, and its doing everything I want it to do. Which is good, I’m busy enough with random faults at work. I hate dealing with them in my own time.

Yowsers

A quick recap on the issues I’ve encountered with my “new” linux box since I built it late last year.

1. PC wouldn’t POST. Turns out the cheap arse motherboard I bought was overclocking my RAM, which didn’t like it much. Fixed by hard setting RAM speed in the BIOS.

2. Writing files across Samba shares resulted in the data written giving CRC errors when read back. In the end I found this was caused by the CPU overheating and getting to 100 degrees celcius. Fixed by replacing the heatsink and fan for a much better aftermarket model, though I suspect the stock Intel one had been faulty.

3. Memory errors in syslog, corresponding to high network traffic. Found that the NIC on the motherboard is incorrectly detected and used by recent Debian/Ubuntu releases. Fix was to install the correct drivers (posted about this earlier). The errors pop up very rarely now, so I’m not sure this is 100% resolved, but it is much better.

4. Hard freezes. This is the last one I’m working on. At first I believed VMware was the cause, so I migrated to Virtualbox. I also tried removing one of the two DIMMS, and thought that fixed it. Only to have the issue pop up weeks later. So today I’ve put the second DIMM back in, as since I removed the old SATA controller, I haven’t had a hard crash.

5. XP VM CPU getting stuck at 50-100% usage. Seems to be a Virtualbox issue, haven’t been able to resolve it fully. I’ve disabled all hardwave VT knobs for the VM itself, which seems to have helped. At least it hasn’t reoccured in quite a while.

6. Linux VM processes segfaulting. Specifically the cacti poller on my oldest VM. This was originally run under windows in VMWare. I suspect this is a software issue with the VM itself, rather than with its environment. The machine is overdue for a complete rebuild (it is running Debian 4), so this sill probably fix itself when I rebuild the machine.

Starcruiser crash! again

I think I’m getting there, slowly but surely. After my last post, the Virtualbox host hung when my XP VM was under very heavy CPU and IO load. It was at this point I realised I still had the old 4 port SATA controller in the box, even though it was my prime suspect for the issues that lead me to building this box to start with.

I’ve now removed the old SATA controller, and given the XP VM a good CPU/IO thrashing. No crash.

These issues have been dragging on since October/November last year. I must be getting to the end of them by now surely. Too bad if I don’t I guess. I can’t bring myself to migrate my personal mail from a VM to Google, nor to host it anywhere else outside of my control. I either have to get this working in a stable, reliable manner. Or I need to take off my tinfoil hat and stop micro managing email.

I’m aiming for the former, but as these issues keep popping up I’m starting to think the latter is more likely.

Here we go again!

So it turns out enabling the ioapic option for my linux vm running withint Virtualbox was not the solution to all of my vm woes. Since December I’ve been through several other “solutions”. Each time I think I’ve found the root cause of the crashing and resolved it. This time, I’m just hoping that I am one step closer to having this fixed.

What have I done this time? I’ve set the kernel to use the correct driver for the network card in the box. It turns out that the NIC I have is a Realtek 8168… but modprobe decided to use a Realtek 8169 driver instead. Apparently this causes all sorts of issues…. possibly the ones I’ve been seeing.

Anyway some smart cookie has written up a nice script to remove the wrong module, compile a new one, and even nicely make the change persistent through reboots. This can be found here.

This box has previously been perfectly stable for a month…. and then crashed repeatedely for a few days until I “fixed” it again. I’ve given the network card a good thrashing already, so if its stable for the next two months… I’ll declare victory. Until then, fingers crossed.

edit: Forgot to mention that previously when copying to/from SMB shares, I’d get kernel memory errors…. which I’d “fixed” by increasing the figure in /proc/sys/vm.min_free_kbytes from its default up to 32000. It didn’t completely fix the issue, but it did partially alleviate the issues. Also I’m running Debian 5 AMD64, but according to the site I linked earlier, this issue is present in most recent Debian and Ubuntu releases.

Return top