I’ve spent some time off and on over the last year or so writing various versions of web crawlers to get different information off of the web. Some of it for a potential business idea, some of it just to learn a few things. One thing I had a hard time trying to figure out was how to deal with threading. I have a list of URLs that I wanted to crawl, but I had specific things that I wanted to try and do with each one, and there were various counters I was incrementing. Plus me and threading don’t jive that well I’ve found. Maybe I’m just not smart enough for it, who knows.
As I was doing my research/learning/reading about C# in general, I ran across the excellent Parallel Processing blog from MSDN. I was fascinated by the Microsoft Biology Foundation and how they were using the parallelism support in .NET 4. The blog is a good read in general. Those guys are a bit too smart for me to keep up with, but it’s fascinating nonetheless.
I’ll let the smart guys at that blog explain it better than I can, but Parallel Processing allows you to execute additional threads if you have additional CPUs available. It’s important to note that you will not gain from this technique if some other outside resource is what is slowing down your processing. But in my case, I am going out to a website and pulling information from different pages. Parallel Processing allowed me to do this much faster than a regular foreach loop. Good stuff.