verbinde dich mit uns

SEO

Is ChatGPT Use Of Web Content Fair?

Veröffentlicht

An

Is ChatGPT Use Of Web Content Fair?

Large Language Models (LLMs) like ChatGPT train using multiple sources of information, including web content. This data forms the basis of summaries of that content in the form of articles that are produced without attribution or benefit to those who published the original content used for training ChatGPT.

Search engines download website content (called crawling and indexing) to provide answers in the form of links to the websites.

Website publishers have the ability to opt-out of having their content crawled and indexed by search engines through the Robots Exclusion Protocol, commonly referred to as Robots.txt.

The Robots Exclusions Protocol is not an official Internet standard but it’s one that legitimate web crawlers obey.

Should web publishers be able to use the Robots.txt protocol to prevent large language models from using their website content?

Large Language Models Use Website Content Without Attribution

Some who are involved with search marketing are uncomfortable with how website data is used to train machines without giving anything back, like an acknowledgement or traffic.

Hans Petter Blindheim (LinkedIn profile), Senior Expert at Curamando shared his opinions with me.

Hans commented:

“When an author writes something after having learned something from an article on your site, they will more often than not link to your original work because it offers credibility and as a professional courtesy.

It’s called a citation.

But the scale at which ChatGPT assimilates content and does not grant anything back differentiates it from both Google and people.

A website is generally created with a business directive in mind.

Google helps people find the content, providing traffic, which has a mutual benefit to it.

But it’s not like large language models asked your permission to use your content, they just use it in a broader sense than what was expected when your content was published.

And if the AI language models do not offer value in return – why should publishers allow them to crawl and use the content?

Does their use of your content meet the standards of fair use?

When ChatGPT and Google’s own ML/AI models trains on your content without permission, spins what it learns there and uses that while keeping people away from your websites – shouldn’t the industry and also lawmakers try to take back control over the Internet by forcing them to transition to an “opt-in” model?”

The concerns that Hans expresses are reasonable.

In light of how fast technology is evolving, should laws concerning fair use be reconsidered and updated?

I asked John Rizvi, a Registered Patent Attorney (LinkedIn profile) who is board certified in Intellectual Property Law, if Internet copyright laws are outdated.

John answered:

“Yes, without a doubt.

One major bone of contention in cases like this is the fact that the law inevitably evolves far more slowly than technology does.

In the 1800s, this maybe didn’t matter so much because advances were relatively slow and so legal machinery was more or less tooled to match.

Today, however, runaway technological advances have far outstripped the ability of the law to keep up.

There are simply too many advances and too many moving parts for the law to keep up.

As it is currently constituted and administered, largely by people who are hardly experts in the areas of technology we’re discussing here, the law is poorly equipped or structured to keep pace with technology…and we must consider that this isn’t an entirely bad thing.

So, in one regard, yes, Intellectual Property law does need to evolve if it even purports, let alone hopes, to keep pace with technological advances.

The primary problem is striking a balance between keeping up with the ways various forms of tech can be used while holding back from blatant overreach or outright censorship for political gain cloaked in benevolent intentions.

The law also has to take care not to legislate against possible uses of tech so broadly as to strangle any potential benefit that may derive from them.

You could easily run afoul of the First Amendment and any number of settled cases that circumscribe how, why, and to what degree intellectual property can be used and by whom.

And attempting to envision every conceivable usage of technology years or decades before the framework exists to make it viable or even possible would be an exceedingly dangerous fool’s errand.

In situations like this, the law really cannot help but be reactive to how technology is used…not necessarily how it was intended.

That’s not likely to change anytime soon, unless we hit a massive and unanticipated tech plateau that allows the law time to catch up to current events.”

So it appears that the issue of copyright laws has many considerations to balance when it comes to how AI is trained, there is no simple answer.

LESEN  5 dollar-earning side hustles you should start in 2023

OpenAI and Microsoft Sued

An interesting case that was recently filed is one in which OpenAI and Microsoft used open source code to create their CoPilot product.

The problem with using open source code is that the Creative Commons license requires attribution.

According to an article published in a scholarly journal:

“Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement.

As GitHub states, ‘…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.’

The resulting product allegedly omitted any credit to the original creators.”

The author of that article, who is a legal expert on the subject of copyrights, wrote that many view open source Creative Commons licenses as a “free-for-all.”

Some may also consider the phrase free-for-all a fair description of the datasets comprised of Internet content are scraped and used to generate AI products like ChatGPT.

Background on LLMs and Datasets

Large language models train on multiple data sets of content. Datasets can consist of emails, books, government data, Wikipedia articles, and even datasets created of websites linked from posts on Reddit that have at least three upvotes.

LESEN  39 Essential SEO Tools to Dominate Every Aspect of Google [Infographic]

Many of the datasets related to the content of the Internet have their origins in the crawl created by a non-profit organization called Common Crawl.

Their dataset, the Common Crawl dataset, is available free for download and use.

The Common Crawl dataset is the starting point for many other datasets that created from it.

For example, GPT-3 used a filtered version of Common Crawl (Language Models are Few-Shot Learners PDF).

This is how  GPT-3 researchers used the website data contained within the Common Crawl dataset:

“Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset… constituting nearly a trillion words.

This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice.

However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets.

Therefore, we took 3 steps to improve the average quality of our datasets:

(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora,

(2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and

(3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.”

Google’s C4 dataset (Colossal, Cleaned Crawl Corpus), which was used to create the Text-to-Text Transfer Transformer (T5), has its roots in the Common Crawl dataset, too.

Their research paper (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer PDF) explains:

“Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on.

We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data.

We refer to our model and framework as the ‘Text-to-Text Transfer Transformer’ (T5).”

Google published an article on their AI blog that further explains how Common Crawl data (which contains content scraped from the Internet) was used to create C4.

LESEN  Internet Archive Website Offline for Hours

They wrote:

“An important ingredient for transfer learning is the unlabeled dataset used for pre-training.

To accurately measure the effect of scaling up the amount of pre-training, one needs a dataset that is not only high quality and diverse, but also massive.

Existing pre-training datasets don’t meet all three of these criteria — for example, text from Wikipedia is high quality, but uniform in style and relatively small for our purposes, while the Common Crawl web scrapes are enormous and highly diverse, but fairly low quality.

To satisfy these requirements, we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia.

Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content.

This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training.”

Google, OpenAI, even Oracle’s Open Data are using Internet content, your content, to create datasets that are then used to create AI applications like ChatGPT.

Common Crawl Can Be Blocked

It is possible to block Common Crawl and subsequently opt-out of all the datasets that are based on Common Crawl.

But if the site has already been crawled then the website data is already in datasets. There is no way to remove your content from the Common Crawl dataset and any of the other derivative datasets like C4 and .

Using the Robots.txt protocol will only block future crawls by Common Crawl, it won’t stop researchers from using content already in the dataset.

How to Block Common Crawl From Your Data

Blocking Common Crawl is possible through the use of the Robots.txt protocol, within the above discussed limitations.

The Common Crawl bot is called, CCBot.

It is identified using the most up to date CCBot User-Agent string: CCBot/2.0

Blocking CCBot with Robots.txt is accomplished the same as with any other bot.

Here is the code for blocking CCBot with Robots.txt.

User-agent: CCBot
Disallow: /

CCBot crawls from Amazon AWS IP addresses.

CCBot also follows the nofollow Robots meta tag:

<meta name="robots" content="nofollow">

What If You’re Not Blocking Common Crawl?

Web content can be downloaded without permission, which is how browsers work, they download content.

Google or anybody else does not need permission to download and use content that is published publicly.

Website Publishers Have Limited Options

The consideration of whether it is ethical to train AI on web content doesn’t seem to be a part of any conversation about the ethics of how AI technology is developed.

It seems to be taken for granted that Internet content can be downloaded, summarized and transformed into a product called ChatGPT.

Does that seem fair? The answer is complicated.

Featured image by Shutterstock/Krakenimages.com



Quellenlink

Behalten Sie im Auge, was wir tun
Seien Sie der Erste, der neueste Updates und exklusive Inhalte direkt in Ihren E-Mail-Posteingang erhält.
Wir versprechen, Sie nicht zuzuspammen. Sie können sich jederzeit abmelden.
Ungültige E-Mail-Adresse
Zum Kommentieren klicken

Hinterlasse eine Antwort

Ihre E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

SEO

Google Enhances Bard’s Reasoning Skills

Veröffentlicht

An

Google Enhances Bard's Reasoning Skills

Google’s language model, Bard, is receiving a significant update today that aims to improve its logic and reasoning capabilities.

Jack Krawczyk, the Product Lead for Bard, and Amarnag Subramanya, the Vice President of Engineering for Bard, announced in a blog post.

A Leap Forward In Reasoning & Math

These updates aim to improve Bard’s ability to tackle mathematical tasks, answer coding questions, and handle string manipulation prompts.

To achieve this, the developers incorporate “implicit code execution.” This new method allows Bard to detect computational prompts and run code in the background, enabling it to respond more accurately to complex tasks.

“As a result, it can respond more accurately to mathematical tasks, coding questions and string manipulation prompts,” the Google team shared in the announcement.

System 1 and System 2 Thinking: A Blend of Intuition and Logic

The approach used in the update takes inspiration from the well-studied dichotomy in human intelligence, as covered in Daniel Kahneman’s book, “Thinking, Fast and Slow.”

The concept of “System 1” and “System 2” thinking is central to Bard’s improved capabilities.

System 1 is fast, intuitive, and effortless, akin to a jazz musician improvising on the spot.

System 2, however, is slow, deliberate, and effortful, comparable to carrying out long division or learning to play an instrument.

LESEN  New Metrics, Copyrighted Music, More

Large Language Models (LLMs), such as Bard, have typically operated under System 1, generating text quickly but without deep thought.

Traditional computation aligns more with System 2, being formulaic and inflexible yet capable of producing impressive results when correctly executed.

“LLMs can be thought of as operating purely under System 1 — producing text quickly but without deep thought,” according to the blog post. However, “with this latest update, we’ve combined the capabilities of both LLMs (System 1) and traditional code (System 2) to help improve accuracy in Bard’s responses.”

A Step Closer To Improved AI Capabilities

The new updates represent a significant step forward in the AI language model field, enhancing Bard’s capabilities to provide more accurate responses.

However, the team acknowledges that there’s still room for improvement:

“Even with these improvements, Bard won’t always get it right… this improved ability to respond with structured, logic-driven capabilities is an important step toward making Bard even more helpful.”

While the improvements are noteworthy, they present potential limitations and challenges.

It’s plausible that Bard may not always generate the correct code or include the executed code in its response.

There could also be scenarios where Bard might not generate code at all. Further, the effectiveness of the “implicit code execution” could depend on the complexity of the task.

LESEN  Inspiring Content Personalization Examples For B2B

In Summe

As Bard integrates more advanced reasoning capabilities, users can look forward to more accurate, helpful, and intuitive AI assistance.

However, all AI technology has limitations and drawbacks.

As with any tool, consider approaching it with a balanced perspective, understanding the capabilities and challenges.


Featured Image: Amir Sajjad/Shutterstock



Quellenlink

Behalten Sie im Auge, was wir tun
Seien Sie der Erste, der neueste Updates und exklusive Inhalte direkt in Ihren E-Mail-Posteingang erhält.
Wir versprechen, Sie nicht zuzuspammen. Sie können sich jederzeit abmelden.
Ungültige E-Mail-Adresse
Weiterlesen

SEO

Microsoft Advertising Boosts Analytics & Global Reach In June Update

Veröffentlicht

An

Microsoft Advertising Boosts Analytics & Global Reach In June Update

Microsoft Advertising details several important updates and expansions in its June product roundup.

The new tools and features aim to enhance website performance analytics, improve cross-device conversion tracking, expand into new global markets, and integrate more seamlessly with other platforms.

Introducing Universal Event Tracking Insights

This month’s standout news is the introduction of Universal Event Tracking (UET) insights, a feature that gives advertisers a deeper understanding of their website’s performance.

The new feature requires no additional coding and will enhance the capabilities of existing UET tags.

“We’re introducing UET insights, a valuable new feature that we’ll add to your existing UET tags with no additional coding required from you. You’ll get a deeper understanding of your website’s performance and also enable Microsoft Advertising to optimize your ad performance more effectively via improved targeting, fraud detection, and reduced conversion loss.”

The new insights tool will roll out automatically starting July 3.

Cross-Device Conversion Attribution Update

Microsoft Advertising is introducing a cross-device attribution model later this month.

This update will enable advertisers to track and connect customers’ conversion journeys across multiple devices and sessions.

Microsoft explains the new feature in a blog article: “For example, if a user clicks on an ad using their laptop but converts on their phone, we’ll now credit that conversion to the initial ad click on the laptop.”

LESEN  39 Essential SEO Tools to Dominate Every Aspect of Google [Infographic]

While the update doesn’t introduce new features or settings, advertisers may notice a slight increase in the number of conversions due to improved accuracy.

Expanding to New Markets

In line with its expansion push throughout 2022, Microsoft Advertising announces it’s expanding its advertising reach to 23 new markets.

The new additions include diverse locations ranging from Antigua and Barbuda to Wallis and Futuna.

This expansion allows advertisers to reach their audiences in more parts of the world.

Seamless Integration With Pinterest & Dynamic Remarketing

Microsoft Advertising is releasing Pinterest Import in all markets via the Microsoft Audience Network (MSAN), allowing advertisers to import campaigns from Pinterest Ads.

Further, Dynamic remarketing on MSAN for Autos, Events & Travel is now available in the US, Canada, and the UK.

The remarketing tool enables advertisers to use their feeds to create rich ad experiences on the Microsoft Audience Network and match their target audience with items in their feed where they’ve shown interest.

In Summe

Key takeaways from the June product roundup include the automatic rollout of UET Insights starting July 3, introducing a new cross-device attribution model, expanding into 23 new global markets, and enhanced integration with Pinterest via the Microsoft Audience Network.

These developments collectively offer advertisers increased insight into campaign performance, improved accuracy in conversion tracking, and more opportunities to reach audiences worldwide.

LESEN  How to Write Website Content That Ranks (And People Want to Read)

Quelle: Microsoft
Featured Image: PixieMe/Shutterstock



Quellenlink

Behalten Sie im Auge, was wir tun
Seien Sie der Erste, der neueste Updates und exklusive Inhalte direkt in Ihren E-Mail-Posteingang erhält.
Wir versprechen, Sie nicht zuzuspammen. Sie können sich jederzeit abmelden.
Ungültige E-Mail-Adresse
Weiterlesen

SEO

Die versteckten Schätze von Apple Safari 17: JPEG XL und Schriftgrößenanpassung

Veröffentlicht

An

Die versteckten Schätze von Apple Safari 17: JPEG XL und Schriftgrößenanpassung

Apple’s recently announced Safari 17 brings several key updates that promise to enhance user experience and web page loading times.

Unveiled at the annual Worldwide Developers Conference (WWDC23), two new features of Safari 17 worth paying attention to are JPEG XL support and expanded capabilities of font-size-adjust.

As Safari continues to evolve, these updates highlight the ever-changing landscape of web development and the importance of adaptability.

JPEG XL: A Game Changer For Page Speed Optimization

One of the most noteworthy features of Safari 17 is its support for JPEG XL, a new image format that balances image quality and file size.

JPEG XL allows for the recompression of existing JPEG files without any data loss while significantly reducing their size—by up to 60%.

Page loading speed is a crucial factor that search engines consider when ranking websites. With JPEG XL, publishers can drastically reduce the file size of images on their sites, potentially leading to faster page loads.

Additionally, the support for progressive loading in JPEG XL means users can start viewing images before the entire file is downloaded, improving the user experience on slower connections.

This benefits websites targeting regions with slower internet speeds, enhancing user experience and potentially reducing bounce rates.

LESEN  YouTube NewFront 2023 präsentierte Shorts

Font Size Adjust: Improving User Experience & Consistency

Safari 17 expands the capabilities of font-size-adjust, a CSS property that ensures the visual size of different fonts remains consistent across all possible combinations of fallback fonts.

By allowing developers to pull the sizing metric from the main font and apply it to all fonts, the from-font value can help websites maintain a consistent visual aesthetic, which is critical for user experience.

Conversely, the two-value syntax provides more flexibility in adjusting different font metrics, supporting a broader range of languages and design choices.

Websites with consistent and clear text display, irrespective of the font in use, will likely provide a better user experience. A better experience could lead to longer visits and higher engagement.

Reimagining SEO Strategies With Safari 17

Given these developments, SEO professionals may need to adjust their strategies to leverage the capabilities of Safari 17 fully.

This could involve:

  • Image Optimization: With support for JPEG XL, SEO professionals might need to consider reformatting their website images to this new format.
  • Website Design: The expanded capabilities of font-size-adjust could require rethinking design strategies. Consistent font sizes across different languages and devices can improve CLS, one of Google’s core web vitals.
  • Performance Tracking: SEO professionals will need to closely monitor the impact of these changes on website performance metrics once the new version of Safari rolls out.
LESEN  21 Writing Tips to Become a Better Writer Fast

In Summe

Apple’s Safari 17 brings new features that provide opportunities to improve several website performance factors crucial for SEO.

Detailed documentation on these Safari 17 updates is available on the official WebKit blog for those interested in delving deeper into these features.


Featured Image: PixieMe/Shutterstock



Quellenlink

Behalten Sie im Auge, was wir tun
Seien Sie der Erste, der neueste Updates und exklusive Inhalte direkt in Ihren E-Mail-Posteingang erhält.
Wir versprechen, Sie nicht zuzuspammen. Sie können sich jederzeit abmelden.
Ungültige E-Mail-Adresse
Weiterlesen

Im Trend

de_DE_formalDeutsch (Sie)