ScrapperMin a Web Automation Language

Important Headers when Doing Scraping

John Kenedy 2017-05-10

When doing scraping, for example getting links and try to download from links. You will encounter sometimes download is served via GET and sometimes POST depends on the website. Some websites serves download using POST because they want to make sure that you didn't accidentally download the same file twice as browser open using GET when it is relaunched (saved last browsed page), because of this POST is used with sometimes secured with extra tokens as POST parameters in order for the download to success. Some websites also has expiry for a download link, either it is served via POST or GET.

Something to Note

  1. Referer header
    Referer header is important when some websites wanted to make sure you follow the link from their website instead of another, so some sites checking on this header which is sent by the browser to the next page when the previous page and next page is within the same domain.
  2. X-Requested-With header
    X-Requested-With header is important when the link you are accessing is meant to be accessed by AJAX calls, AJAX is a way a website load a page from behing without refreshing the webpage and then changes the current page DOM according to the AJAX's returns. This header is important when you are accessing link that is normally accessed by browser through javascript asynchronously or using AJAX.
  3. User-Agent
    While less significantly important however User-Agent header must be consistent between each request until you finally reaches the last link you need. Some websites check the initial request and the subsequent request whether it matches the same User-Agent. This checking might be done due to different behavior of different browsers and not necessary a security feature.

It is important when doing scraping, you know you have sent the correct headers of HTTP protocol, the best way to know whether you have sent the correct headers is by simulating using real browsers and captured the request using software like Fiddler or the default built-in Developer Tools of each browser. Some HTTP request can't be captured by browsers built-in Developer Tools when it uses Websocket to access the page, because socket act at lower TCP level which sent a HTTP protocol and resulting in a normal web access however due to the browser's nature for not acknowledging such access as a HTTP request resulting the Network section of Developer Tools unable to capture the request, the best way for capturing HTTP request is via software call Fiddler.

Other headers are important but not covered such as Content-Type or encoding related headers which is a common headers, by default emulating the browser content type and sending data in similar way is enough to let the server understand the data you are passing.