How to protect content against scrapers with headless browsers?

We are living in a century of information. The data could bring huge value to your business. That's why a lot of different scrapers exist in the world. I guess each owner of a website at least once was thinking about how to protect their content. One of the ways is to use JavaScript to render content. However, modern technologies allow running a headless browser on the server and scraping all required data. Let's make scraper's creator life a bit harder…

I want to show you an example of how to protect content from scrapers that use headless browsers for rendering content. However, first, I need to tell you some theories. Feel free to skip the next paragraph if you are a smart ass.

Shadow Root

Let me introduce you Shadow Root. If one of you doesn't know what it is, I'll try to explain. Shadow Root allows you to create separate DOM inside your current DOM. It's kind of a document inside a document. Why could it be useful? For example, you can make your web component with your HTML IDs and CSS classes without taking care of other code on the web page. If you want to learn more, try to read documentation and this article.

Protection

So, how can we use shadow root to protect content from scrapers? Shadow Root has two modes: open and closed. If the mode is closed, then you can't access elements inside JavaScript. That makes it impossible to scrape content from those elements by evaluating JS inside the browser.

I'll show you an example. Let's create an empty HTML document with a div for sensitive data.

<strong>Following content should be protected:</strong>
<div id="protected-content">
</div>

We need to use JavaScript, so let's create a script block after our div and put the following code for the test. I use shadow root in an open mode for now.

<strong>Following content should be protected:</strong>
<div id="protected-content">
</div>
<script>
    const element = document.querySelector('#sensitive-content');
    element.attachShadow({ mode: 'open' });
    element.shadowRoot.innerHTML = '<p>Hello from protected area, buddy!</p>';
</script>

What did we get as a result?

Looks like our shadow root code is working. However, it didn't give any protection. You could easily access it by using the code like element.shadowRoot.

element.attachShadow({ mode: 'closed' });

Oops... Looks like my code doesn't work now.

Shadow root is not accessible if mode is closed. — Shadow root is not accessible if mode is closed

As I said before, closed shadow root can't be accessible from JavaScript. That means element.shadowRoot will always be null if the mode is closed.

How can we fix that? Let's create a custom HTML component. At first, I removed div with id protected-content and the old JavaScript tag. Then I made a new script block inside the head and put the following code. The new code contains a declaration of the ProtectedWebComponent, the constructor, and connectedCallback method. You can find similar examples on the internet.

class ProtectedWebComponent extends HTMLElement {
    constructor() {
        super();
        this._protected_root = this.attachShadow({ mode: "closed" });
    }
    connectedCallback() {
        this._protected_root.innerHTML = &lt;p>Hello from protected area, buddy!&lt;/p>;
    }
}
window.customElements.define("protected-web-component", ProtectedWebComponent);

Then I added a new component to HTML.

<strong>Following content should be protected:</strong>
<protected-web-component></protected-web-component>

Does it work? Yeah, looks good.

Closed shadow root example. Scrapers can't access it. — Closed shadow root example

However, there is still a problem. You can't access shadow root by calling something like element.shadowRoot, but there is still _protected_root field. So, it allows us to get shadow root by calling element._protected_root. How could we fix that? Let's initialize content inside the constructor and don't save the shadow root link anywhere.

class ProtectedWebComponent extends HTMLElement {
    constructor() {
        super();
        const shadowRootLink = this.attachShadow({ mode: "closed" });
        shadowRootLink.innerHTML = &lt;p>Hello from protected area, buddy!&lt;/p>;
    }
}
window.customElements.define("protected-web-component", ProtectedWebComponent);

Looks much better now.

protected content against scrapers — It's impossible to access closed shadow root from JavaScript

Source code

If you want to check the source code, welcome to my GitHub repository.

Results

As you can see, it's pretty to protect some content using shadow root. However, you should understand this way is not a magic stick. Using shadow root to protect content could make a web scraper developer's life more complex, but it doesn't put content in some bulletproof armor. If you are interested in how to avoid this protection, write a comment below, and I'll try to explain.

By the way, I have an article how to download Vimeo video using Python. Maybe you are interested.

Jakeroid's BLOG | Ivan Karabadzhak