I've been running my web scraping service (obzrv.it) for quite some time already. I've had a lot of fun working on it,
mastered my Go coding skills and learned a lot about web scraping, monitoring and automation.
Sadly, I found it super difficult to attract any paying customers. I was spending $70 per month on AWS EC2 instance. The thing with
web scraping is that - to make it really work - you need to open web pages in a real browser, typically Chrome because it offers nice
controlling & debugging features. And that's not cheap 😢 Chrome needs resources, mostly RAM - 4 GB is a minimum.
This meant I needed a machine with 8 GB of RAM and ideally 4 cores, so more complex pages render quickly. Speed also matters,
because it's opening multiple pages, multiple times a day. That costed some money, and
I had to shut the service down after around a year, to avoid further losses.
Meanwhile, while working on another project, I was playing with a technology called port-forwarding. In short, it allows you to expose your local computer to the Internet, in a way that your home PC accepts connections coming from the Internet, just like a real server machine from a hosting provider. You run a small SSH command on your computer (Secure SHell, a way to securely connect to a remote machine, and a standard tool for every developer), connect to a remote server and tell it to forward all traffic from a specific port to another port on your local machine. Technically, that remote computer is your "website" - or rather a "gateway" - but it doesn't really handle users requests, instead it forwards all incoming traffic to your local machine, as if the user connected to your home PC. The big difference in my case was: because just forwarding the traffic doesn't require much resources, neither CPU nor RAM, I could run my web scraping service again, very cheaply 😊
As described in my previous blog post, I'm running my web scraping service on a Raspberry Pi, at my home. Thanks to the trick with port forwarding, my Raspberry acts as a real server machine, and I can run my service on it, for a fraction of a cost of a real server. But as with every server, I also need to access it remotely, to install updates, check logs, or just to see if it's still running. Also, because it runs a web scraping service, I need to see the browser window, to see if the pages are rendered correctly. This means I cannot simply connect to it via SSH, I need to see the desktop, and I need to be able to access it from anywhere in the world.
My Raspberry runs on Ubuntu Linux and hence I decided to use VNC to connect to it. VNC is a remote desktop software, which allows you to see the desktop of a remote machine, and interact with it, as if you were sitting in front of it. I could already connect to my Pi via VNC locally, just using it local IP address, but I wanted to connect to it from anywhere in the world.
Here, also port forwarding comes to the rescue. Just like my web scraping service running on the same device, VNC is - technically - just another server which listens on a specific port, and waits for incoming connections. While web servers typically listen on port 80 or 8080, VNC servers listen on port 5900. So I just had to run another SSH command on my Raspberry, which connects to my remote AWS host and tells it to forward all traffic from port 5900 to the Raspberry Pi, also port 5900.
Please enter your registered email address, we’ll send you a link to reset your password.
Check your e-mail for instructions on how to reset your password.
For extra security please create a separate password for your user account
You can now use your new password to log in to your account
Subscription processed successfully!