Two pages with the same title and/or meta-description within one domain

Two pages with the same title and/or meta-description within one domain - url

Is there any penalty on Google rankings for using two pages with the same title and/or meta-description? If so, what is the penalty?
Both pages are on the same domain. One page URL is example.com/abcd and the other page URL is example.com/uvwxyz. The H1 header for both pages is the same, and both have the same meta-description.

I don't think Google would punish this.
Think of YouTube (which is owned by Google): The content of the title element follows this schema: [user-contributed video title] - YouTube. The meta-description consists of the user-contributed video description.
Now, there are probably thousands of videos with the very same title ("My cute cat") and some of them could even have the same description ("See my cute cat").
However, if a website consists of many (or even only) pages with same title and meta-description, it gambles away the possibilty for a better ranking. But when all these pages really have different content, it won't be punished.

Title, Meta Description are among the signals which search engines uses to identify topic of the page and rank them in search results. Weight of Title is high in search rankings & both title/description are displayed in search results along with URL.
As you have mentioned content of both pages are different, than by
having duplicate title/description you are loosing some opportunity
of targeting different keywords for search rankings.
Having same title/description makes it difficult for both user as well as search
engines to identify & differentiate between them.
Even though there is no negative influence, but you are loosing on important signal (title) which can help in improving search ranking.
Some ref reading on title: http://www.searchenabler.com/blog/title-tag-optimization-tips-for-seo/
& duplicate content: http://www.searchenabler.com/blog/learn-seo-duplicate-content/

There is not a punishment per se' it just isn't best practice to use. Why will you have duplicate meta information? Is the information the same on each page? Does it need to be?

Related

How do i trace multiple XML elements with same name & without any id?

I am trying to scrape a website for financials of Indian companies as a side project & put it in Google Sheets using XPATH
Link: https://ticker.finology.in/company/AFFLE
I am able to extract data from elements that have a specific id like cash, net debt, etc. however I am stuck with extracting data for labels like Sales Growth.
tried
Copying the full xpath from console, //*[#id="mainContent_updAddRatios"]/div[13]/p/span - this works, however, i am reliable on the index of the div (13) and that may change for different companies, hence i am unable to automate it.
Please assist with a scalable solution
PS: I am a Product Manager with basic coding expertise as I was a developer few years ago.

At some point you need to "hardcode" something unless you have some other means of mapping the content of the page to your spreadsheet. In your example you appear to be targeting "Sales Growth" percentage. If you are not comfortable hardcoding the index of the div (13), you could identify it by the id of the "Sales Growth" label which is mainContent_lblSalesGrowthorCasa.
For example, change your
//*[#id="mainContent_updAddRatios"]/div[13]/p/span
to:
//*[#id = "mainContent_updAddRatios"]/div[.//span/#id = "mainContent_lblSalesGrowthorCasa"]/p/span
which is selecting the div based on the div containing a span with id="mainContent_lblSalesGrowthorCasa". Ultimately, whether you "hardcode" the exact index of the div or "hardcode" the ids of the nodes, you are still embedding assumptions regarding the structure of page.

Thanks #david, that helped.
Two questions
What if the structure of the page would change? Example: If the website decided to remove the p tag then would my sheet fail? How do we avoid failure in such cases?
Also, since every id is unique, the probability of that getting changed is lesser than the index being changed. Correct me, if I am wrong?
What do we do when the elements don't have an id like Profit Growth, RoE, RoCE etc

Search Engine for a blog_website(searching inside links )

I created a very basic search option for my blog, and as per topics and key words it is generating results but what i am looking for is in certain articles i have to add links so if my search can go through those links that are basically external websites for example if i am referring to someone else blog for more information then search to find from that.Is it possible ? And i don't want to go for GCSE.
Thanks in advance. It will be of great help.
Thanks again.

Yes, it is possible to write a bot to crawl external websites from links. I've made one. It crawled 100K+ website URLs. So yes, it is possible to make one, which can crawl links from your blog.
To create a search engine, you'll need to know some internals regarding how they work...
Search Bots work like this:
Crawler fetches pages. This step is pretty easy, as it uses curl.
Parser splits the HTML into pieces, so that data can be extracted from the page. This has 2 sub-components to it, which...
a. Extracts any data from the page that you want to capture & then saves that data into a database.
b. Extracts links & places them back into the crawling queue. This creates an infinite loop, so your bot never stops crawling... (Unless someone else's malformed URL crashes it, which happens a lot. So be ready to frequently fix it.)
Indexer creates lookup indexes, which map keywords to the web page's contents. This has 2 sub-components to it, as it...
a. Creates a Forward Index, which maps each document to keywords that are inside of that document.
doc1 | bird, aviary, robin, dove, blue jay, cardinal
doc2 | birds, bird watching, binoculars
doc3 | cats, eat, birds
doc4 | cats, generally, don't, like, water, nor, neighborhood, dogs
doc5 | dog, shows, look, fun
b. Creates an Inverted Index from the Forward Index, which reverses the indices. This allows users to search by keyword & then the search script looks up & suggests which documents, that users may want to view. Like so...
bird | doc1, doc2
cat | doc3, doc4
dog | doc4, doc5
Search Forms work like this:
Search Form shows the HTML input box to the user.
Search Script will search the Inverted Index to find which document links to display in the Search Engine Results Page.
Search Engine Results Page (yes, SERP is an actual industry acronym for Search Engine Results Page). This displays the list of search result links. You can style it any way that you'd like & it doesn't have to look like Google's, Microsoft's Bing nor Yahoo's engines.
Examples:
Searching for:
"bird" returns links to "doc1, doc2"
"cat" returns links to "doc3, doc4"
"dog" returns links to "doc4, doc5"
Good luck building your search engine for your blog!

Notice ''Too many URLs ''

I get notice in my google analytics panel:
Too many URLs
The number of unique URLs for owner review All site data exceeds the daily limit. Excess data is displayed in the summary row (left) in the reports.
Google Analytics summarizes data when too many rows in a table in one day. When you send too many unique URLs, surplus value summary is displayed in a single line reports the value (left) for the URL. This severely slows down your ability to perform detailed analysis of the URL.
Too many URLs are usually the result of a combination of a number of unique URL parameters. To avoid surpassing the limitations, the typical solution could be the exclusion of irrelevant parameters from URLs. For example, open the section Administator> Settings ownership review and use the setting Excluding query parameters in the URL to exclude parameters such as sessionid and vision. If your site supports a site search, use the Search Settings sites to track the parameters related to the search while at the same time removes from URLs
How this will affect on my website ?
I not understand why i get this error and how to fix this ?
I check this what google suggest:
Administator> Settings ownership review and use the setting Excluding query parameters in the URL to exclude parameters such as sessionid and vision.
Can anyone explain me how to use this for fix problem ?
Thanks.

It does not affect your website, is affects the GA reports only.
The url for any pageview is stored in the "page" dimension. Google Analytics can display at maximum 50 000 distinct values for this dimension for the selected timeframe. In your case there are more than 50 000 values, so any excess pages will be grouped together in a row labeled "other".
Now it may be that you have more than 50 000 distinct urls in your website, but Google thinks that this is unlikely, so it suggests to check if you are "artifically" inflating the number of distinct values for the page dimensions.
A bad but simple example: Imagine you allow your users to choose their own background color for your site, and that the choice of color was transmitted in an query parameter. So you might have e.g.
index.php?bgcolor=#cc0000
index.php?bgcolor=#ee5500
index.php?bgcolor=#000000
....
Due to the query parameter these urls would show up as three different pages, i.e. three different rows in the Google Analytics reports, despite the fact that in all cases the same content is displayed (albeit with different background colors).
The proper strategy in this case would be to go to the admin section, view settings, and in the "ignore url parameters" box insert the bgcolor parameter (and all other parameters that do not change the content that is display). Now the parameter will be stripped from the url before the data is processed, and the pageview will be aggregated into a single row with index.php as single value for the page dimension (of course you have to insert your own query parameter names).

How can I smartly extract information from an HTML page?

I am building something that can more or less extract key information from an arbitrary web site. For example, if I crawled a McDonalds page and wanted to figure out programatically the opening and closing time of McDonalds, what is an intelligent way to do it?
In a general case, maybe I also want to find out whether McDonalds sells chicken wings, or the address of McDonalds.
What I am thinking is that I will have a specific case for time, wings, and address and have code that is unique for each of those 3 cases.
But I am not sure how I can approach this. I have the sites crawled and HTML and related information parsed into JSON already. My current approach is something like finding the title tag and checking if the title tag contains key words like address or location, etc. If the title contains those key words, then I will look through the current page and identify chunks of content that resemble an address, such as content that are cities or countries or content that has the word St or Street inside.
I am wondering if there is a better approach to look for key data, and looking for a nicer starting point or bounce some ideas and whatnot. Or even if there are good articles to read about this would be great as well.
Let me know if this is unclear.
Thanks for the help.

In order to parse such HTML pages you have to have knowlege about their structure. There's no general solution for this problem. Each webpage needs its own solution. However, a good approach would be to ensure the HTML code is valid XML too and then use XPath to access elements at known positions. Maybe there's even an XPath like solution for standard HTML (which is not always valid xml). This way you can define a set of XPaths for each page which give you the specific elements if they exist.

Represent the search result by adding relevant description

I'm developing simple search engine.If I search some thing using my search engine it will produce the list of urls which are relating with that search query.
I want to represent the search result by giving small,relevant description under each resulting url.(eg:- if we search something on google,you can see they will provide small description with the each resulting link.)
Any idea..?
Thank in advance!

You need to store position of each word in a webpage while indexing.
your index should contain- word id , document id of the document containing this word, number of occurrence of the word in that document , all the positions where the word occurred.
For more info you can read the research paper by Google founders-
The Anatomy of a Large-Scale Hypertextual Web Search Engine

You can fetch the meta content of that page and display it as a small description . Google also does this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart