Connect with us

SEO

Is ChatGPT Use Of Web Content Fair?

Published

on

Is ChatGPT Use Of Web Content Fair?

Large Language Models (LLMs) like ChatGPT train using multiple sources of information, including web content. This data forms the basis of summaries of that content in the form of articles that are produced without attribution or benefit to those who published the original content used for training ChatGPT.

Search engines download website content (called crawling and indexing) to provide answers in the form of links to the websites.

Website publishers have the ability to opt-out of having their content crawled and indexed by search engines through the Robots Exclusion Protocol, commonly referred to as Robots.txt.

The Robots Exclusions Protocol is not an official Internet standard but it’s one that legitimate web crawlers obey.

Should web publishers be able to use the Robots.txt protocol to prevent large language models from using their website content?

Large Language Models Use Website Content Without Attribution

Some who are involved with search marketing are uncomfortable with how website data is used to train machines without giving anything back, like an acknowledgement or traffic.

Hans Petter Blindheim (LinkedIn profile), Senior Expert at Curamando shared his opinions with me.

Hans commented:

“When an author writes something after having learned something from an article on your site, they will more often than not link to your original work because it offers credibility and as a professional courtesy.

It’s called a citation.

But the scale at which ChatGPT assimilates content and does not grant anything back differentiates it from both Google and people.

A website is generally created with a business directive in mind.

Google helps people find the content, providing traffic, which has a mutual benefit to it.

But it’s not like large language models asked your permission to use your content, they just use it in a broader sense than what was expected when your content was published.

And if the AI language models do not offer value in return – why should publishers allow them to crawl and use the content?

Does their use of your content meet the standards of fair use?

When ChatGPT and Google’s own ML/AI models trains on your content without permission, spins what it learns there and uses that while keeping people away from your websites – shouldn’t the industry and also lawmakers try to take back control over the Internet by forcing them to transition to an “opt-in” model?”

The concerns that Hans expresses are reasonable.

In light of how fast technology is evolving, should laws concerning fair use be reconsidered and updated?

I asked John Rizvi, a Registered Patent Attorney (LinkedIn profile) who is board certified in Intellectual Property Law, if Internet copyright laws are outdated.

John answered:

“Yes, without a doubt.

One major bone of contention in cases like this is the fact that the law inevitably evolves far more slowly than technology does.

In the 1800s, this maybe didn’t matter so much because advances were relatively slow and so legal machinery was more or less tooled to match.

Today, however, runaway technological advances have far outstripped the ability of the law to keep up.

There are simply too many advances and too many moving parts for the law to keep up.

As it is currently constituted and administered, largely by people who are hardly experts in the areas of technology we’re discussing here, the law is poorly equipped or structured to keep pace with technology…and we must consider that this isn’t an entirely bad thing.

So, in one regard, yes, Intellectual Property law does need to evolve if it even purports, let alone hopes, to keep pace with technological advances.

The primary problem is striking a balance between keeping up with the ways various forms of tech can be used while holding back from blatant overreach or outright censorship for political gain cloaked in benevolent intentions.

The law also has to take care not to legislate against possible uses of tech so broadly as to strangle any potential benefit that may derive from them.

You could easily run afoul of the First Amendment and any number of settled cases that circumscribe how, why, and to what degree intellectual property can be used and by whom.

And attempting to envision every conceivable usage of technology years or decades before the framework exists to make it viable or even possible would be an exceedingly dangerous fool’s errand.

In situations like this, the law really cannot help but be reactive to how technology is used…not necessarily how it was intended.

That’s not likely to change anytime soon, unless we hit a massive and unanticipated tech plateau that allows the law time to catch up to current events.”

So it appears that the issue of copyright laws has many considerations to balance when it comes to how AI is trained, there is no simple answer.

OpenAI and Microsoft Sued

An interesting case that was recently filed is one in which OpenAI and Microsoft used open source code to create their CoPilot product.

The problem with using open source code is that the Creative Commons license requires attribution.

According to an article published in a scholarly journal:

“Plaintiffs allege that OpenAI and GitHub assembled and distributed a commercial product called Copilot to create generative code using publicly accessible code originally made available under various “open source”-style licenses, many of which include an attribution requirement.

As GitHub states, ‘…[t]rained on billions of lines of code, GitHub Copilot turns natural language prompts into coding suggestions across dozens of languages.’

The resulting product allegedly omitted any credit to the original creators.”

The author of that article, who is a legal expert on the subject of copyrights, wrote that many view open source Creative Commons licenses as a “free-for-all.”

Some may also consider the phrase free-for-all a fair description of the datasets comprised of Internet content are scraped and used to generate AI products like ChatGPT.

Background on LLMs and Datasets

Large language models train on multiple data sets of content. Datasets can consist of emails, books, government data, Wikipedia articles, and even datasets created of websites linked from posts on Reddit that have at least three upvotes.

Many of the datasets related to the content of the Internet have their origins in the crawl created by a non-profit organization called Common Crawl.

Their dataset, the Common Crawl dataset, is available free for download and use.

The Common Crawl dataset is the starting point for many other datasets that created from it.

For example, GPT-3 used a filtered version of Common Crawl (Language Models are Few-Shot Learners PDF).

This is how  GPT-3 researchers used the website data contained within the Common Crawl dataset:

“Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset… constituting nearly a trillion words.

This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice.

However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets.

Therefore, we took 3 steps to improve the average quality of our datasets:

(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora,

(2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and

(3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.”

Google’s C4 dataset (Colossal, Cleaned Crawl Corpus), which was used to create the Text-to-Text Transfer Transformer (T5), has its roots in the Common Crawl dataset, too.

Their research paper (Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer PDF) explains:

“Before presenting the results from our large-scale empirical study, we review the necessary background topics required to understand our results, including the Transformer model architecture and the downstream tasks we evaluate on.

We also introduce our approach for treating every problem as a text-to-text task and describe our “Colossal Clean Crawled Corpus” (C4), the Common Crawl-based data set we created as a source of unlabeled text data.

We refer to our model and framework as the ‘Text-to-Text Transfer Transformer’ (T5).”

Google published an article on their AI blog that further explains how Common Crawl data (which contains content scraped from the Internet) was used to create C4.

They wrote:

“An important ingredient for transfer learning is the unlabeled dataset used for pre-training.

To accurately measure the effect of scaling up the amount of pre-training, one needs a dataset that is not only high quality and diverse, but also massive.

Existing pre-training datasets don’t meet all three of these criteria — for example, text from Wikipedia is high quality, but uniform in style and relatively small for our purposes, while the Common Crawl web scrapes are enormous and highly diverse, but fairly low quality.

To satisfy these requirements, we developed the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl that is two orders of magnitude larger than Wikipedia.

Our cleaning process involved deduplication, discarding incomplete sentences, and removing offensive or noisy content.

This filtering led to better results on downstream tasks, while the additional size allowed the model size to increase without overfitting during pre-training.”

Google, OpenAI, even Oracle’s Open Data are using Internet content, your content, to create datasets that are then used to create AI applications like ChatGPT.

Common Crawl Can Be Blocked

It is possible to block Common Crawl and subsequently opt-out of all the datasets that are based on Common Crawl.

But if the site has already been crawled then the website data is already in datasets. There is no way to remove your content from the Common Crawl dataset and any of the other derivative datasets like C4 and .

Using the Robots.txt protocol will only block future crawls by Common Crawl, it won’t stop researchers from using content already in the dataset.

How to Block Common Crawl From Your Data

Blocking Common Crawl is possible through the use of the Robots.txt protocol, within the above discussed limitations.

The Common Crawl bot is called, CCBot.

It is identified using the most up to date CCBot User-Agent string: CCBot/2.0

Blocking CCBot with Robots.txt is accomplished the same as with any other bot.

Here is the code for blocking CCBot with Robots.txt.

User-agent: CCBot
Disallow: /

CCBot crawls from Amazon AWS IP addresses.

CCBot also follows the nofollow Robots meta tag:

<meta name="robots" content="nofollow">

What If You’re Not Blocking Common Crawl?

Web content can be downloaded without permission, which is how browsers work, they download content.

Google or anybody else does not need permission to download and use content that is published publicly.

Website Publishers Have Limited Options

The consideration of whether it is ethical to train AI on web content doesn’t seem to be a part of any conversation about the ethics of how AI technology is developed.

It seems to be taken for granted that Internet content can be downloaded, summarized and transformed into a product called ChatGPT.

Does that seem fair? The answer is complicated.

Featured image by Shutterstock/Krakenimages.com



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Click to comment

Leave a Reply

Your email address will not be published. Required fields are marked *

SEO

TikTok CEO To Testify In Hearing On Data Privacy And Online Harm Reduction

Published

on

TikTok CEO To Testify In Hearing On Data Privacy And Online Harm Reduction

TikTok CEO Shou Chew will testify in a hearing before the U.S. House Committee on Energy and Commerce this Thursday, March 23, at 10:00 a.m. ET.

As CEO, Chew is responsible for TikTok’s business operations and strategic decisions.

The “TikTok: How Congress Can Safeguard American Data Privacy and Protect Children from Online Harms” hearing will be streamed live on the Energy and Commerce Committee’s website.

According to written testimony submitted by Chew, the hearing will focus on TikTok’s alleged commitment to transparency, teen safety, consumer privacy, and data security.

It also appears to broach the topic of misconceptions about the platform, such as its connection to the Chinese government through its parent company, ByteDance.

Chew shared a special message with TikTok yesterday from Washington, D.C., to thank 150 million users, five million businesses, and 7,000 employees in the U.S. for helping build the TikTok community.

@tiktokOur CEO, Shou Chew, shares a special message on behalf of the entire TikTok team to thank our community of 150 million Americans ahead of his congressional hearing later this week.♬ original sound – TikTok

The video has received over 85k comments from users, many describing how TikTok has allowed them to interact with people worldwide and find unbiased news, new perspectives, educational content, inspiration, and joy.

TikTok Updates Guidelines And Offers More Educational Content

TikTok has been making significant changes to its platform to address many of these concerns before this hearing to evade a total U.S. ban on the platform.

Below is an overview of some efforts by TikTok to rehab its perception before the hearing.

Updated Community Guidelines – TikTok updated community guidelines and shared its Community Principles to demonstrate commitment to keeping the platform safe and inclusive for all users.

For You Feed Refresh – TikTok recommends content to users based on their engagement with content and creators. For users who feel that recommendations no longer align with their interests, TikTok introduced the ability to refresh the For You Page, allowing them to receive fresh recommendations as if they started a new account.

STEM Feed – To improve the quality of educational content on TikTok, it will introduce a STEM feed for content focused on Science, Technology, Engineering, and Mathematics. Unlike the content that appears when users search the #STEM hashtag, TikTok says that Common Sense Networks and Poynter will review STEM feed content to ensure it is safe for younger audiences and factually accurate.

This could make it more like the version of TikTok in China – Douyin – that promotes educational content to younger audiences over entertaining content.

Series Monetization – To encourage creators to create in-depth, informative content, TikTok introduced a new monetization program for Series content. Series allows creators to earn income by putting up to 80 videos with up to 20 minutes in length, each behind a paywall.

More Congressional Efforts To Restrict TikTok

The TikTok hearing tomorrow isn’t the only Congressional effort to limit or ban technologies like TikTok.

Earlier this month, Sen. Mark Warner (D-VA) introduced the RESTRICT Act (Restricting the Emergence of Security Threats that Risk Information and Communications Technology), which would create a formal process for the government to review and mitigate risks of technology originating in countries like China, Cuba, Iran, North Korea, Russia, and Venezuela.

Organizations like the Tech Oversight Project have pointed out that Congress should look beyond TikTok and investigate similar risks to national security and younger audiences posed by other Big Tech platforms like Amazon, Apple, Google, and Meta.

We will follow tomorrow’s hearing closely – be sure to come back for our coverage to determine how this will affect users and predict what will happen next.


Featured Image: Alex Verrone/Shutterstock



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

How Is It Different From GPT-3.5?

Published

on

How Is It Different From GPT-3.5?

GPT-4, the latest version of ChatGPT, OpenAI’s language model, is a breakthrough in artificial intelligence (AI) technology that has revolutionized how we communicate with machines.

ChatGPT’s multimodal capabilities enable it to process text, images, and videos, making it an incredibly versatile tool for marketers, businesses, and individuals alike.

What Is GPT-4?

GPT-4 is 10 times more advanced than its predecessor, GPT-3.5. This enhancement enables the model to better understand context and distinguish nuances, resulting in more accurate and coherent responses.

Furthermore, GPT-4 has a maximum token limit of 32,000 (equivalent to 25,000 words), which is a significant increase from GPT-3.5’s 4,000 tokens (equivalent to 3,125 words).

“We spent 6 months making GPT-4 safer and more aligned. GPT-4 is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.” – OpenAI

GPT-3.5 Vs. GPT-4 – What’s Different?

GPT-4 offers several improvements over its predecessor, some of which include:

1. Linguistic Finesse

While GPT-3.5 is quite capable of generating human-like text, GPT-4 has an even greater ability to understand and generate different dialects and respond to emotions expressed in the text.

For example, GPT-4 can recognize and respond sensitively to a user expressing sadness or frustration, making the interaction feel more personal and genuine.

Screenshot from OpenAI, March 2023

One of the most impressive aspects of GPT-4 is its ability to work with dialects, which are regional or cultural variations of a language.

Dialects can be extremely difficult for language models to understand, as they often have unique vocabulary, grammar, and pronunciation that may not be present in the standard language.

However, GPT-4 has been specifically designed to overcome these challenges and can accurately generate and interpret text in various dialects.

2. Information Synthesis

GPT-4 can answer complex questions by synthesizing information from multiple sources, whereas GPT-3.5 may struggle to connect the dots.

For example, when asked about the link between the decline of bee populations and the impact on global agriculture, GPT-4 can provide a more comprehensive and nuanced answer, citing different studies and sources.

ChatGPT 4 article exampleScreenshot from OpenAI, March 2023

Unlike its predecessor, GPT-4 now includes a feature that allows it to properly cite sources when generating text.

This means that when the model generates content, it cites the sources it has used, making it easier for readers to verify the accuracy of the information presented.

3. Creativity And Coherence

While GPT-3.5 can generate creative content, GPT-4 goes a step further by producing stories, poems, or essays with improved coherence and creativity.

For example, GPT-4 can produce a short story with a well-developed plot and character development, whereas GPT-3.5 might struggle to maintain consistency and coherence in the narrative.

ChatGPT 4 creative exampleScreenshot from OpenAI, March 2023
ChatGPT 4 creative writingScreenshot from OpenAI, March 2023

4. Complex Problem-Solving

GPT-4 demonstrates a strong ability to solve complex mathematical and scientific problems beyond the capabilities of GPT-3.5.

For example, GPT-4 can solve advanced calculus problems or simulate chemical reactions more effectively than its predecessor.

ChatGPT 4 complex physics problemScreenshot from OpenAI, March 2023
ChatGPT 4 equations of motionScreenshot from OpenAI, March 2023

GPT-4 has significantly improved its ability to understand and process complex mathematical and scientific concepts. Its mathematical skills include the ability to solve complex equations and perform various mathematical operations such as calculus, algebra, and geometry.

In addition, GPT-4 is also capable of handling scientific subjects such as physics, chemistry, biology, and astronomy.

Its advanced processing power and language modeling capabilities allow it to analyze complex scientific texts and provide insights and explanations easily.

As the technology continues to evolve, it is likely that GPT-4 will continue to expand its capabilities and become even more adept at a wider range of subjects and tasks.

5. Programming Power

GPT-4’s programming capabilities have taken social media by storm with its ability to generate code snippets or debug existing code more efficiently than GPT-3.5, making it a valuable resource for software developers.

With the help of GPT-4, weeks of work can be condensed into a few short hours, allowing extraordinary results to be achieved in record time. You can test these prompts:

  • “Write code to train X with dataset Y.”
  • “I’m getting this error. Fix it.”
  • “Now improve the performance.”
  • “Now wrap it in a GUI.”

6. Image And Graphics Understanding

Unlike GPT-3.5, which focuses primarily on text, GPT-4 can analyze and comment on images and graphics.

For example, GPT-4 can describe the content of a photo, identify trends in a graph, or even generate captions for images, making it a powerful tool for education and content creation.

ChatGPT problem solving with dataScreenshot from OpenAI, March 2023

Imagine this technology integrated with Google Analytics or Matomo. You could get highly accurate analytics for all your dashboards in a few minutes.

7. Reduction Of Inappropriate Or Biased Responses

GPT-4 implements mechanisms to minimize undesirable results, thereby increasing reliability and ethical responsibility.

For example, GPT-4 is less likely to generate politically biased, offensive, or harmful content, making it a more trustworthy AI companion than GPT-3.5.

Where Can ChatGPT Go Next?

Despite its remarkable advancements, ChatGPT still has room for improvement:

  • Addressing neutrality: Enhancing its ability to discern the context and respond accordingly.
  • Understanding the user: Developing the capacity to understand who is communicating (who, where, and how).
  • External integrations: Expanding its reach through web, API, and robotic integrations.
  • Long-term memory: Improving its ability to recall past interactions and apply that knowledge to future conversations.
  • Reducing hallucination: Minimizing instances where the AI is convinced of false information.

As ChatGPT continues to evolve, it is poised to revolutionize marketing and AI-driven communications.

Its potential applications in content creation, education, customer service, and more are vast, making it an essential tool for businesses and individuals in the digital age.

More Resources:


Featured Image: LALAKA/Shutterstock



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

SEO

Should Congress Investigate Big Tech Platforms?

Published

on

Should Congress Investigate Big Tech Platforms?

This week, the House Energy and Commerce Committee will hold a full committee hearing with TikTok CEO Shou Chew to discuss how the platform handles users’ data, its effect on kids, and its relationship with ByteDance, its Chinese parent company.

This hearing is part of an ongoing investigation to determine whether TikTok should be banned in the United States or forced to split from ByteDance.

A ban on TikTok would affect over 150 million Americans who use TikTok for education, entertainment, and income generation.

It would also affect the five million U.S. businesses using TikTok to reach customers.

Is TikTok The Only Risk To National Security?

According to a memo released by the Tech Oversight Project, TikTok is not the only tech platform that poses risks to national security, mental health, and children.

As Congress scrutinizes TikTok, the Tech Oversight Project also strongly urges an investigation of risks posed by tech companies like Amazon, Apple, Meta, and Google.

These platforms have a documented history of serving content harmful to younger audiences and adversarial to U.S. interests. They have also failed on many occasions to protect users’ private data.

Many Big Tech companies have seen TikTok’s success and tried to emulate some of its features to encourage users to spend as much time within their platforms’ ecosystems as possible. Academics, activists, non-governmental organizations, and others have long raised concerns about these platforms’ risks.

To truly reduce Big Rech’s risks to our society, Congress must look beyond TikTok and hold other companies accountable for the same dangers they pose to national security, mental health, and private data.

Risks Posed By Big Tech Companies

The following are examples of the risks Big Tech companies pose to U.S. users.

Amazon

Amazon has made several controversial moves, including a partnership with a state propaganda agency to launch a China books portal and offering AWS services to Chinese companies, including a banned surveillance firm with ties to the military.

Apple

Independent research found that Apple collects detailed information about its users, even when users choose not to allow tracking by apps from the App Store. Over half of the top 200 suppliers for Apple operate factories in China.

Google

The FTC fined Google and YouTube $170 million for collecting children’s data without parental consent. YouTube also changed its algorithm to make it more addictive, increasing users’ time watching videos and consuming ads.

Meta

Facebook allowed Cambridge Analytica to harvest the private data of over 50 million users. It also failed to notify over 530 million users of a data breach that resulted in users’ private data being stolen.

It also allowed Russian interference in the 2016 elections. The influence operation posed as an independent news organization with 13 accounts and two pages, pushing messages critical of right-wing voices and the center-left.

TikTok 

TikTok employees confirmed that its Chinese parent company, ByteDance, is involved in decision-making and has access to TikTok’s user data. While testifying before the Senate Homeland Security Committee, Vanessa Pappas, TikTok COO, would not confirm whether ByteDance would give TikTok user data to the Chinese government.

Conclusion

While the dangers posed by TikTok are undeniable, it’s clear that Congress should also address the risks posed throughout the tech industry. By holding all major offenders accountable, we can create a safe, secure, and responsible digital landscape for everyone.


Featured Image: Koshiro K/Shutterstock



Source link

Keep an eye on what we are doing
Be the first to get latest updates and exclusive content straight to your email inbox.
We promise not to spam you. You can unsubscribe at any time.
Invalid email address
Continue Reading

Trending

en_USEnglish