SEO
Är ChatGPT användning av webbinnehåll rättvist?

Large Language Models (LLMs) like ChatGPT train using multiple sources of information, including web content. This data forms the basis of summaries of that content in the form of articles that are produced without attribution or benefit to those who published the original content used for training ChatGPT.
Search engines download website content (called crawling and indexing) to provide answers in the form of links to the websites.
Website publishers have the ability to opt-out of having their content crawled and indexed by search engines through the Robots Exclusion Protocol, commonly referred to as Robots.txt.
The Robots Exclusions Protocol is not an official Internet standard but it’s one that legitimate web crawlers obey.
Should web publishers be able to use the Robots.txt protocol to prevent large language models from using their website content?
Large Language Models Use Website Content Without Attribution
Some who are involved with search marketing are uncomfortable with how website data is used to train machines without giving anything back, like an acknowledgement or traffic.
Hans Petter Blindheim (LinkedIn profile), Senior Expert at Curamando shared his opinions with me.
Hans commented:
“When an author writes something after having learned something from an article on your site, they will more often than not link to your original work because it offers credibility and as a professional courtesy.
It’s called a citation.
But the scale at which ChatGPT assimilates content and does not grant anything back differentiates it from both Google and people.
A website is generally created with a business directive in mind.
Google helps people find the content, providing traffic, which has a mutual benefit to it.
But it’s not like large language models asked your permission to use your content, they just use it in a broader sense than what was expected when your content was published.
And if the AI language models do not offer value in return – why should publishers allow them to crawl and use the content?
Does their use of your content meet the standards of fair use?
When ChatGPT and Google’s own ML/AI models trains on your content without permission, spins what it learns there and uses that while keeping people away from your websites – shouldn’t the industry and also lawmakers try to take back control over the Internet by forcing them to transition to an “opt-in” model?”
The concerns that Hans expresses are reasonable.
In light of how fast technology is evolving, should laws concerning fair use be reconsidered and updated?
I asked John Rizvi, a Registered Patent Attorney (LinkedIn profile) who is board certified in Intellectual Property Law, if Internet copyright laws are outdated.
John answered:
“Yes, without a doubt.
One major bone of contention in cases like this is the fact that the law inevitably evolves far more slowly than technology does.
In the 1800s, this maybe didn’t matter so much because advances were relatively slow and so legal machinery was more or less tooled to match.
Today, however, runaway technological advances have far outstripped the ability of the law to keep up.
There are simply too many advances and too many moving parts for the law to keep up.
As it is currently constituted and administered, largely by people who are hardly experts in the areas of technology we’re discussing here, the law is poorly equipped or structured to keep pace with technology…and we must consider that this isn’t an entirely bad thing.
So, in one regard, yes, Intellectual Property law does need to evolve if it even purports, let alone hopes, to keep pace with technological advances.
The primary problem is striking a balance between keeping up with the ways various forms of tech can be used while holding back from blatant overreach or outright censorship for political gain cloaked in benevolent intentions.
The law also has to take care not to legislate against possible uses of tech so broadly as to strangle any potential benefit that may derive from them.
You could easily run afoul of the First Amendment and any number of settled cases that circumscribe how, why, and to what degree intellectual property can be used and by whom.
And attempting to envision every conceivable usage of technology years or decades before the framework exists to make it viable or even possible would be an exceedingly dangerous fool’s errand.
In situations like this, the law really cannot help but be reactive to how technology is used…not necessarily how it was intended.
That’s not likely to change anytime soon, unless we hit a massive and unanticipated tech plateau that allows the law time to catch up to current events.”
So it appears that the issue of copyright laws has many considerations to balance when it comes to how AI is trained, there is no simple answer.
OpenAI and Microsoft Sued
An interesting case that was recently filed is one in which OpenAI and Microsoft used open source code to create their CoPilot product.
The problem with using open source code is that the Creative Commons license requires attribution.
According to an article published in a scholarly journal:
“Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement.
As GitHub states, ‘…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.’
The resulting product allegedly omitted any credit to the original creators.”
The author of that article, who is a legal expert on the subject of copyrights, wrote that many view open source Creative Commons licenses as a “free-for-all.”
Some may also consider the phrase free-for-all a fair description of the datasets comprised of Internet content are scraped and used to generate AI products like ChatGPT.
Background on LLMs and Datasets
Large language models train on multiple data sets of content. Datasets can consist of emails, books, government data, Wikipedia articles, and even datasets created of websites linked from posts on Reddit that have at least three upvotes.
Many of the datasets related to the content of the Internet have their origins in the crawl created by a non-profit organization called Common Crawl.
Their dataset, the Common Crawl dataset, is available free for download and use.
The Common Crawl dataset is the starting point for many other datasets that created from it.
For example, GPT-3 used a filtered version of Common Crawl (Language Models are Few-Shot Learners PDF).
This is how GPT-3 researchers used the website data contained within the Common Crawl dataset:
“Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset… constituting nearly a trillion words.
This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice.
However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets.
Therefore, we took 3 steps to improve the average quality of our datasets:
(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora,
(2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and
(3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.”
Google’s C4 dataset (Colossal, Cleaned Crawl Corpus), which was used to create the Text-to-Text Transfer Transformer (T5), has its roots in the Common Crawl dataset, too.
Their research paper (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer PDF) explains:
“Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on.
We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data.
We refer to our model and framework as the ‘Text-to-Text Transfer Transformer’ (T5).”
Google published an article on their AI blog that further explains how Common Crawl data (which contains content scraped from the Internet) was used to create C4.
They wrote:
“An important ingredient for transfer learning is the unlabeled dataset used for pre-training.
To accurately measure the effect of scaling up the amount of pre-training, one needs a dataset that is not only high quality and diverse, but also massive.
Existing pre-training datasets don’t meet all three of these criteria — for example, text from Wikipedia is high quality, but uniform in style and relatively small for our purposes, while the Common Crawl web scrapes are enormous and highly diverse, but fairly low quality.
To satisfy these requirements, we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia.
Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content.
This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training.”
Google, OpenAI, even Oracle’s Open Data are using Internet content, your content, to create datasets that are then used to create AI applications like ChatGPT.
Common Crawl Can Be Blocked
It is possible to block Common Crawl and subsequently opt-out of all the datasets that are based on Common Crawl.
But if the site has already been crawled then the website data is already in datasets. There is no way to remove your content from the Common Crawl dataset and any of the other derivative datasets like C4 and .
Using the Robots.txt protocol will only block future crawls by Common Crawl, it won’t stop researchers from using content already in the dataset.
How to Block Common Crawl From Your Data
Blocking Common Crawl is possible through the use of the Robots.txt protocol, within the above discussed limitations.
The Common Crawl bot is called, CCBot.
It is identified using the most up to date CCBot User-Agent string: CCBot/2.0
Blocking CCBot with Robots.txt is accomplished the same as with any other bot.
Here is the code for blocking CCBot with Robots.txt.
User-agent: CCBot Disallow: /
CCBot crawls from Amazon AWS IP addresses.
CCBot also follows the nofollow Robots meta tag:
<meta name="robots" content="nofollow">
What If You’re Not Blocking Common Crawl?
Web content can be downloaded without permission, which is how browsers work, they download content.
Google or anybody else does not need permission to download and use content that is published publicly.
Website Publishers Have Limited Options
The consideration of whether it is ethical to train AI on web content doesn’t seem to be a part of any conversation about the ethics of how AI technology is developed.
It seems to be taken for granted that Internet content can be downloaded, summarized and transformed into a product called ChatGPT.
Does that seem fair? The answer is complicated.
Utvald bild av Shutterstock/Krakenimages.com
SEO
WordPress WooCommerce Payments Plugin sårbarhet

Automattic, publishers of the WooCommerce plugin, announced the discovery and patch of a critical vulnerability in the WooCommerce Payments plugin.
The vulnerability allows an attacker to gain Administrator level credentials and perform a full site-takeover.
Administrator is the highest permission user role in WordPress, granting full access to a WordPress site with the ability to create more admin-level accounts as well as the ability to delete the entire website.
What makes this particular vulnerability of great concern is that it’s available to unauthenticated attackers, which means that they don’t first have to acquire another permission in order to manipulate the site and obtain admin-level user role.
WordPress security plugin maker Wordfence described this vulnerability:
“After reviewing the update we determined that it removed vulnerable code that could allow an unauthenticated attacker to impersonate an administrator and completely take over a website without any user interaction or social engineering required.”
The Sucuri Website security platform published a warning about the vulnerability that goes into further details.
Sucuri explains that the vulnerability appears to be in the following file:
/wp-content/plugins/woocommerce-payments/includes/platform-checkout/class-platform-checkout-session.php
They also explained that the “fix” implemented by Automattic is to remove the file.
Sucuri observes:
“According to the plugin change history it appears that the file and its functionality was simply removed altogether…”
The WooCommerce website published an advisory that explains why they chose to completely remove the affected file:
“Because this vulnerability also had the potential to impact WooPay, a new payment checkout service in beta testing, we have temporarily disabled the beta program.”
The WooCommerce Payment Plugin vulnerability was discovered on March 22, 2023 by a third party security researcher who notified Automattic.
Automattic swiftly issued a patch.
Details of the vulnerability will be released on April 6, 2023.
That means any site that has not updated this plugin will become vulnerable.
What Version of WooCommerce Payments Plugin is Vulnerable
WooCommerce updated the plugin to version 5.6.2. This is considered the most up to date and non-vulnerable version of the website.
Automattic has pushed a forced update however it’s possible that some sites may not have received it.
It is recommended that all users of the affected plugin check that their installations are updated to version WooCommerce Payments Plugin 5.6.2
Once the vulnerability is patched, WooCommerce recommends taking the following actions:
“Once you’re running a secure version, we recommend checking for any unexpected admin users or posts on your site. If you find any evidence of unexpected activity, we suggest:
Updating the passwords for any Admin users on your site, especially if they reuse the same passwords on multiple websites.
Rotating any Payment Gateway and WooCommerce API keys used on your site. Here’s how to update your WooCommerce API keys. For resetting other keys, please consult the documentation for those specific plugins or services.”
Read the WooCommerce vulnerability explainer:
Critical Vulnerability Patched in WooCommerce Payments – What You Need to Know
SEO
Hur städar du upp innehåll utan att påverka rankingen?

Today’s Ask An SEO question comes from Neethu, who asks:
My website is almost 20 years old. There are lots of content. Many of them are not performing well. How do you effectively clean up those content without effecting rankings?
Contrary to what some SEO pros tell you, more content is not always better.
Deciding what content to keep, which content to modify, and which content to throw away is an important consideration, as content is the backbone of any website and is essential for driving traffic, engagement, and conversions.
However, not all content is created equal, and outdated, irrelevant, or underperforming content can hinder a website’s success.
Run A Content Audit
To effectively clean up your website’s content, the first step is to conduct a content audit.
This involves analyzing your site’s content and assessing its performance, relevance, and quality.
You can use various metrics such as traffic, bounce rate, and engagement to identify which pages are performing well and which ones are not.
Once you have identified the pages that are not performing well, it’s important to prioritize them based on their importance to your website.
Pages that are not driving traffic or conversions may need to be prioritized over pages that are not performing well but are still important for your site’s overall goals.
Distinguish Evergreen Vs. Time-Sensitive Content
Additionally, it’s important to consider whether a page is evergreen or time-sensitive.
You can update or repurpose evergreen content over time, while you may need to remove time-sensitive content.
After prioritizing your content, you can decide what action to take with each page.
For pages that are still relevant but not performing well, you may be able to update them with fresh information to improve their performance.
For pages that are outdated or no longer relevant, it may be best to remove them altogether.
When removing content, implement 301 redirects to relevant pages to ensure that any backlinks pointing to the old page are not lost.
Monitor Your Stuff
It’s important to monitor your search engine rankings after cleaning up your content to ensure your changes do not negatively impact your SEO.
But don’t just look at rankings.
Content optimization projects can affect traffic, conversions, navigation, and other items that impact your overall search engine optimization efforts.
Watch Google Analytics closely. If there are traffic declines, you may need to re-evaluate a few changes.
It’s important not to have a knee-jerk reaction, however.
Before you throw out your optimization efforts, be sure that the changes you made are actually what is causing a drop – and make sure those changes are stable within the search engines index.
Remember that it may take some time for your rankings to stabilize after a content cleanup, so it’s important to be patient and monitor your website’s performance over time.
To further optimize your content cleanup, consider using Google Search Console to identify pages with high impressions but low click-through rates.
These pages may benefit from content updates or optimization to improve their performance.
Additionally, consolidating pages that cover similar topics into one comprehensive page can improve user experience and help avoid keyword cannibalization.
Sammanfattningsvis
Cleaning up your website’s content is crucial for maintaining a high-quality site.
By conducting a content audit, prioritizing your content, and deciding whether to keep, update, or remove the content, you can effectively clean up your site without negatively impacting your rankings.
Remember to monitor your rankings and be patient as your site adjust.
Fler resurser:
Featured Image: Song_about_summer/Shutterstock
SEO
Optimera din SEO-strategi för maximal ROI med dessa 5 tips

Wondering what improvements can you make to boost organic search results and increase ROI?
If you want to be successful in SEO, even after large Google algorithm updates, be sure to:
- Keep the SEO fundamentals at the forefront of your strategy.
- Prioritize your SEO efforts for the most rewarding outcomes.
- Focus on uncovering and prioritizing commercial opportunities if you’re in ecommerce.
- Dive into seasonal trends and how to plan for them.
- Get tip 5 and all of the step-by-step how-tos by joining our upcoming webinar.
We’ll share five actionable ways you can discover the most impactful opportunities for your business and achieve maximum ROI.
You’ll learn how to:
- Identifiera säsongstrender och planera för dem.
- Rapportera om och optimera din röstandel online.
- Maximera SERP-funktionsmöjligheter, framför allt populära produkter.
Gå med Jon Earnshaw, Chief Product Evangelist och medgrundare av Pi Datametrics, och Sophie Moule, Head of Product and Marketing på Pi Datametrics, när de leder dig genom sätt att drastiskt förbättra avkastningen på din SEO-strategi.
I den här livesessionen kommer vi att upptäcka innovativa sätt du kan öka din sökstrategi och överträffa dina konkurrenter.
Är du redo att börja maximera dina resultat och växa ditt företag?
Anmäl dig nu och få de praktiska insikterna du behöver för att lyckas med SEO.
Kan du inte delta i livewebinariet? Vi har dig täckt. Registrera dig ändå så får du tillgång till en inspelning efter evenemanget.
-
AMAZON4 dagar ago
De 10 bästa fördelarna med Amazon AWS Lightsail: Varför det är ett utmärkt val för företag
-
SEO17 timmar ago
Optimera din SEO-strategi för maximal ROI med dessa 5 tips
-
WORDPRESS3 dagar ago
Intern länkning för SEO: Den ultimata guiden för bästa praxis
-
SÖKMOTORER13 timmar ago
Google Mars Space Office Design på Belo Horizonte, Brasilien
-
SÖKMOTORER2 dagar ago
Google Search Status Dashboard Lägger till Google Ranking Updates
-
SÖKMOTORER1 dag ago
Google Bard länkar inte till källor för ofta
-
WORDPRESS6 dagar ago
De bästa webbhotelllösningarna för din personliga webbsida eller företagswebbplats
-
SÖKMOTORER1 dag ago
Google Search Console visar om embedURL-sidan använder indexifembedded