Unauthorized intelligence systems are pretending to be a Google bots

October 28, 2021 4AM 1

Bots and crawlers are the new competitive intelligence systems

Websites are developed daily for different purposes, and each website contains additional meta-information that is supposed to be private.

Modern competitive intelligence systems allow the competitive adversaries to access information about websites, pretending to be a Google indexing bots.

Image showing competitive intelligence and associated fields

Image source – pim.com

Table of Contents:

Competitive Intelligence Systems ‘bots’
- Why are the crawlers so concerning?
- Information copying from competitors
Competitive Intelligence Tools working
Website crawling scripts and code snippets
Methods to protect and hide the web application from competitive intelligence services ‘bots’
Conclusion

The bots have resulted in the extraction of personal data on an increased scale. This has raised serious privacy concerns, as well as the threat to information security.

Awareness about the malicious and tricky algorithms is necessary, especially for the youth. As stated in the NBC article published in 2018, it was mentioned that Twitter bots are present that are used to steal information and identities for gaining profit.

Chatbots are available on hundreds of websites where customers require information about the products, delivery, and shipping. The chatbots are based on the application developed for interaction with the user.

As mentioned by the researchers, the current chatbots employ artificial intelligence to learn and provide better and quicker replies to the user for futuristic conversations.

Despite being a giant and leader in the field of search and optimization on the internet, Google has got limitations, as the method of interaction of the Googlebot is not like a normal user, especially if the website depends on JavaScript.

To understand the bots' insecurity or stealing of information, it is necessary to understand the crawlers first.

As Marcin explains about the crawlers, also known as spiders, are the computer programs capable of reaching the primary webpages and secondary as well, finally performing mapping and indexing of the website contents?

Why are the crawlers so concerning?

As an owner of the website, if somebody wants the webpage to have appeared on any search engine, the crawlers of that search engine should visit that site first.

The main issue that the companies are facing is that the various crawlers can generate random queries on any website, gather and extract valuable data, and collect the actual copies of the web page visited.

Some of these bots have collected sensitive personal data of the users and companies from the websites, such as exclusive packages, incentives offered to customers, and policies; ultimately, this information is provided to the competitors.

Copying information from competitors

According to Dmitry Ugnichenko from MegaIndex:

The information obtained using competitive intelligence systems assists the counterparts in overcoming the search results.

The implementation and employment of competitive intelligence services allow the counterpart to access valuable information and data from the website.

Moreover, provide various marketing by copying other counterparts' business strategies and promotion techniques.

The initial process of competitive intelligence information gathering consists of various following steps:

Competitor’s search: MegaIndex is a platform that provides the services of finding the closest competitors based on the region and most close affiliation with your chosen domain.

Furthermore, the platform helps analyze the competitor's website to assist in the most efficient decision-making strategy.

Image showing the process of competitor search and information copying

Image source — megaindex.com

MegaIndex provides its users with the functionality of searching for suitable competitors by keywords.

The valuable data, including Site structure, Content for titles, Headings, External Links, and snippets obtained from the competitors, constitute significantly useful strategies for efficient and optimum solutions.

Competitive intelligence systems provide countless businesses and companies with the most efficient marketing and promotion strategies for the best solutions to various problems.

But at the same time, it is causing significant problems amongst the security experts regarding private information being accessed without authorization.

In the era of the information system’s revolution, every aspect of business, commerce, and many other domains of life are handled by web applications.

Competitive intelligence systems, otherwise known as ‘bots’, help the developers understand and locate the errors and bugs in their programs.

Sometimes, it gets difficult to differentiate between a legitimate bot and a harmful bot.

Competitive Intelligence tools working

There are many web-based applications present online that help in carrying out competitive intelligence analysis.

The analysis of competitors' websites gives out important information about the changes occurring at every moment.

The information gathered by competitive intelligence analysis is useful for understanding the methods and ways the competitors are promoting their services and products.

There is a vast range of tools available on GitHub that allow users to carry out competitive analysis; the domains of such tools include:

Monitoring and alerting
SQL client
Data visualization
Integration
Modeling
Database modeler

One of the most popular and common search engine tools used for competitive intelligence information gathering is Google.

Methods to search and look for the particular information element are given knowledge about the proper and appropriate query search keywords.

Image showing search operators for information gathering

Google offers a wide range of search results as competitive intelligence if you know what to look for.

With the right query, search operators can gather a vast amount of data and information about the competitors.

Google's advanced search option plays a critical role in carrying out competitive intelligence search and analysis.

Image showing advance search options from Google

Image showing advance search results from Google advanced query search

Like previously mentioned, various web applications offer a variety of options for information gathering from competitors.

Image showing competitive intelligence information gathered

Crunchbase web application offering a wide range of information about the competitors in their respective business and domains by internet crawler information gathering.

Website crawling scripts and code snippets

Image attributes information gathering crawler code
Tag attribute crawler code
HREF attribute SEO crawler code
Python script for SEO crawler

Various methods, scripts, and code snippets launch web crawlers as competitive, intelligent agents for information gathering from all domains.

Some code snippets are provided for the information gathering purposes

Image attributes information gathering crawler code

function crawlPage($url) {

$dom = new DOMDocument('1.0');

// Loading HTML content in $dom

@$dom->loadHTMLFile($url);

// Selecting all image i.e. img tag object

$node = $dom -> getElementsByTagName('img');

foreach ($node as $element) {

$src = $element -> getAttribute('src');

$alt = $element -> getAttribute('alt');

$height = $element -> getAttribute('height');

$width = $element -> getAttribute('width');

echo '.$src.'" alt="'.$alt.'" height="'

. $height.'" width="'.$width.'"/>';

}

Alt: “Python script for SEO crawler.”

The script given above provides the information about the image posted on the target website. The photo elements, such as source, alt, height, and width, can be easily extracted by running the script.

Tag attribute crawler code

$link_arr = array();

$tags_arr = array();

$link = $_REQUEST['links'];

$str_arr = explode(",", $link);

// print_r($str_arr);

foreach($str_arr as $ele) {

$trimmed = trim($ele);

// echo $trimmed;

array_push($link_arr,$trimmed);

}

$tags = $_REQUEST['checked_tag'];

foreach ($tags as $checked_tag){

$tags = trim($checked_tag);

array_push($tags_arr,$tags);

// echo $tags;

}

function crawlPage($url,$arrr) {

$dom = new DOMDocument;

// Loading HTML content in $dom

foreach ($url as $linkname) {

echo " Crawling for: ".$linkname."";

echo '
';

echo "

@$dom->loadHTMLFile($linkname);

foreach($arrr as $tagname ){

echo "

}

echo "

$node = $dom -> getElementsByTagName($tagname);

echo ''.$tagname.'';

echo '
';

// Extracting attribute from each object

if ($tagname == 'meta') {

$count = 0;

foreach ($node as $element) {

$count = count($node);

if ($element -> getAttribute('name')){

echo "".$element -> getAttribute('name')."".":".$element -> getAttribute('content');

echo "
";

}else {

echo "".$element -> getAttribute('property')."".":".$element -> getAttribute('content');

echo '
';

}

echo " Total Count: ".$count."";

}

elseif($tagname == 'a') {

$count = 0;

foreach ($node as $element) {

$count = count($node);

$href= $element -> getAttribute('href');

$content = $element -> nodeValue;

$final = htmlspecialchars(".'"'.$href.'"'.">".$content."",ENT_QUOTES);

echo $final;

echo '
';

}

echo " Total Count: ".$count."";

}

elseif($tagname == 'img') {

echo '
';

$count = 0;

foreach ($node as $key => $element) {

$count = count($node);

$src = $element -> getAttribute('src');

$alt = $element -> getAttribute('alt');

$data_src = $element -> getAttribute('data-src');

$final = htmlspecialchars(".'"'.$src.'"'." "."data-src=".'"'.$data_src.'"'." "."alt=".'"'.$alt.'"'.">",ENT_QUOTES);

// echo 'src='.$src.' alt='.$alt.'
';

echo $final;

echo '
';

}

echo " Total Count: ".$count."";

}

elseif($tagname == 'link') {

$count = 0;

foreach ($node as $element) {

$atr = $element -> getAttribute('rel');

if ($atr == 'canonical') {

$count++;

echo 'Canonical: '.$element -> getAttribute('href').'
';

}

echo " Total Count: ".$count."";

}

else {

$count = 0;

foreach ($node as $element) {

// echo $element -> nodeValue;

$count = count($node);

echo htmlspecialchars('<'.$tagname.'>'.$element -> nodeValue.'.$tagname.'>');

// echo $element -> tagName;

echo '
';

}

echo " Total Count: ".$count."";

echo '
';

}

echo "

echo '
';

}

crawlPage($link_arr,$tags_arr);

The above-given script gives important information about the tags used in the target website.

HREF attribute SEO crawler code

function crawlPage($url) {

$dom = new DOMDocument('1.0');

// Loading HTML content in $dom

@$dom->loadHTMLFile($url);

$anchors = $dom -> getElementsByTagName('a');

// Extracting attribute from each object

foreach ($anchors as $element) {

$atr = $element -> getAttribute('href');

echo $atr.'
';

}

The extraction of the ‘href’ attribute from each of the objects can be easily carried out by running the above script

Python script for SEO crawler

import scrapy

class SEJSpider(scrapy.Spider):

name = 'sejspider'

start_urls = ['https://www.searchenginejournal.com/']

limit = 3 #Only fetch 3 pages for testing

def parse(self, response):

#first, let's grab all the article blocks from the latest posts section on each page

for title in response.css('#posts-tab-1 > article'):

##from each article block grab its title from the anchor text of its link

yield {'title': title.css('h2 > a ::text').get()}

#after we grab all article titles, let't grab the link to the next page

for next_page in response.css('a.next'):

self.limit = self.limit - 1

#stop after fetching limit pages

if self.limit < 0:

break

yield response.follow(next_page, self.parse)

The first loop grabs the article blocks from the Latest Post section, while the second loop in the code only follows the following link.

Methods to protect and hide the web application from competitive intelligence services ‘bots’

Program script for blocking access to some ‘bots’ at server-level
Steps to restrict access to a website by competitive intelligence systems.
Steps for reverse DNS lookup

Various methods are being developed each day to counter the unauthorized access to certain elements in web applications to protect private content.

The least effective method implemented for this purpose is the restriction on using the robots.txt.file.

For 25 years, the implementation and application of Resilient Ethernet Protocol (REP) have served as a critical asset for the Search Engine Optimizers (SEOs).

This protocol restricted bots' access to certain specified content available on the website. Thus implementing the directives in the robots.txt file, which assists in reducing the overall flow of bots.

Results are lowered traffic on the site.

Program script for blocking access to some ‘bots’ at server-level

The instruction prohibits bots from parsing sites at the server level. It is not advisory in nature like robots.txt, but a server ban on processing requests from bots. Add to .htaccess:

Options FollowSymLinks ExecCGI

RewriteEngine On

RewriteBase /

RewriteCond% {REQUEST_FILENAME}! -F

RewriteCond% {REQUEST_FILENAME}! -D

RewriteRule ^ [^ 4] * / 404 [L, S = 4000]

RewriteEngine on

RewriteCond% {HTTP_USER_AGENT} ". * AhrefsBot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * MJ12bot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * RogerBot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * MegaIndex \ .ru / 2 \ .0. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * YandexBot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * Ia_archiver. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * Bingbot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * Baiduspider. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * Archive \ .org_bot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * BLEXBot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * LinkpadBot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * Spbot. *" [OR]

RewriteCond% {HTTP_USER_AGENT} ". * Serpstatbot. *"

RewriteRule ". *" "-" [F]

Script source -- https://indexoid.com/#backlinks

Although the protocols enacted through the robots.txt file can critically help in the reduction of traffic and improvement in the bandwidth of the website, unfortunately, it was not correctly formalized, and the standard was not set.

Given the technological advancements happening every day, the directives in the robot.txt files do not guarantee protection against CI’s.

To counter this problem, an efficient approach is to restrict the access of these CI’s and unwanted programs at the server level.

The server-level scripts implemented for blocking deny the access based on the User-agent values assigned to each line in the visit logs.

Steps to restrict access to a website by competitive intelligence systems

The CI access the website. As the CI accesses the website, the IP address is analyzed.
The IP address of the CI can easily identify the host.
As the Domain Name Server for the well-known and reliable sources can be identified, if the CI’s domain name does not match the domain name of the CI that is allowed, it can be identified as unauthorized access by a bot that does not belong to the search engine.
The reverse DNS query helps in the analysis of the IP address of the CI, hence assisting the identification of the hostname of the competitive intelligence source.

Steps for reverse DNS lookup

The reverse DNS lookup is significantly easy, but for blocking, a separate script is required, which can be implemented in any language following the proper procedures

Open Command Prompt by typing ‘cmd’ in the search bar of your device.
Type in the following commands
nslookup (IP address)
ping -a

Python script for SEO crawler

Conclusion

The progress in the automation

of intelligence and data indexing needs, all together, have caused significant privacy and security concerns as some applications to violate the user’s privacy and personal data.

Due to unauthorized access by third-party competitive intelligence systems, critical information present on the website is exposed, which has significant effects.

We have proposed various approaches to counter this problem with explanations in detail.

Share

Post

Share