I had the chance to investigate how we could automate downloads from a couple of websites. The current process is excruciatingly manual, and ripe for errors (as all manual processes are).
I first went to check out the websites to see what we were dealing with here. Is there an API that could be used to pull the files down? Is there some sort of service that they could provide to push the files to us? No. And no, of course.
So no API. No clean way to do it. I’ll just have to login programatically and download what I need. Of course the two sites I was accessing had completely different implementations.
The first one was pretty easy. It just uses basic authentication and then allows you to proceed. Why a public-facing web application uses basic authentication in 2015 I don’t know, but I guess that’s another conversation.
Here’s how I implemented it. I also needed to actually download the file by sending a POST to a particular URL. I needed to save it somewhere specific so that’s included as well.
Uri uri = new Uri(_authUrl); var credentialCache = new CredentialCache(); credentialCache.Add( new Uri(uri.GetLeftPart(UriPartial.Authority)), // request url's host "Basic", // authentication type. hopefully they don't change it. new NetworkCredential(_uname, _pword) // credentials ); using (WebClient client = new WebClient()) { client.UseDefaultCredentials = true; client.Credentials = credentialCache; System.Collections.Specialized.NameValueCollection formParams = new System.Collections.Specialized.NameValueCollection(); // This is the stuff that the form on the page expects to see. Pulled from the HTML source and javascript function. formParams.Add("param1", "value1"); formParams.Add("param2", "value2"); formParams.Add("param3", "value3"); formParams.Add("filename", _downloadFileName); byte[] responsebytes = client.UploadValues(_urlForDownload, "POST", formParams); // Write the file somewhere? NOTE: location must exist. May want to do something to make sure of that when implementing exception handling if (!Directory.Exists(_fileDownloadLocation)) Directory.CreateDirectory(_fileDownloadLocation); File.WriteAllBytes(string.Format(@"{0}\{1}", _fileDownloadLocation, _downloadFileName), responsebytes); }
The other website used Forms Authentication in its implementation. While this was a welcomed difference (since, again it’s 2015), it did make it a little bit more difficult.
I couldn’t just use C#’s WebClient again because it doesn’t deal with cookies. And most applications on the internet use sessions, cookies, and other such hackery to keep track of you and make sure that you’re really logged in and are who you say you are.
I found an implementation of what seems to be called a “cookie-aware WebClient.” I don’t recall which site I got it from, but many implement it in a very similar way. Here is the code for a class called WebClientEx. It simply extends WebClient:
public class WebClientEx : WebClient { public WebClientEx(CookieContainer container) { this.container = container; } private readonly CookieContainer container = new CookieContainer(); protected override WebRequest GetWebRequest(Uri address) { WebRequest r = base.GetWebRequest(address); var request = r as HttpWebRequest; if (request != null) { request.CookieContainer = container; } return r; } protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result) { WebResponse response = base.GetWebResponse(request, result); ReadCookies(response); return response; } protected override WebResponse GetWebResponse(WebRequest request) { WebResponse response = base.GetWebResponse(request); ReadCookies(response); return response; } private void ReadCookies(WebResponse r) { var response = r as HttpWebResponse; if (response != null) { CookieCollection cookies = response.Cookies; container.Add(cookies); } } }
And its usage for me is as follows:
CookieContainer cookieJar = new CookieContainer(); HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(_urlForLoginPage); req.CookieContainer = cookieJar; req.Method = "GET"; Uri uri; // First send a request to the login page so that we can get the URL that we will be redirected to, which contains the proper // querystring info we'll need. using (HttpWebResponse response = (HttpWebResponse)req.GetResponse()) { uri = response.ResponseUri; } // The c# WebClient will not persists cookies by default. Found this WebClientEx class that does what we need for this using (WebClientEx ex = new WebClientEx(cookieJar)) { var postData = string.Format("USER={0}&PASSWORD={1}&target={2}", _uname, _pword, _urlForDownload); var resp = ex.UploadString(uri, postData); // Note that useUnsafeHeaderParsing is set to true in app.config. The response from this URL is not well-formed, so was throwing // an exception when parsed by the "strict" default method. ex.DownloadFile(_wirelessToWireline, string.Format(@"{0}\FILE1-{1}.TXT", _fileDownloadLocation, DateTime.Now.ToString("yyyyMMdd"))); ex.DownloadFile(_wirelineToWireless, string.Format(@"{0}\FILE2-{1}.TXT", _fileDownloadLocation, DateTime.Now.ToString("yyyyMMdd"))); }
You’ll often hear of people struggling with the 401 redirect that is sent back. It’s basically the server sending back the challenge for credentials. In my case, I needed to send the request and get the information that was appended to the querystring anyway, so it was handy. I then posted the data to the form that the application would be expecting, and downloaded my file.
Also note that the server I was downloading the information from sent back the response in a way that the .NET framework didn’t like, by default. So I had to set useUnsafeHeaderParsing to true. This was an acceptable risk for me. Make sure that you know what it means.
This took longer than I care to admit to implement, but once I found and understood the “cookie-aware” concept, it worked out pretty well.