:HTML structure analysis-Picture material template recommendation免费ppt模版下载-道格办公

HTML structure analysis

Hypertext Markup Language (Hypertext Markup Language) is an acronym for creating the structure of web pages. HTML uses tags to describe individual elements in a web page, such as headings, paragraphs, links, etc. Web browsers can read HTML files, and rend

< section data-tool='mdnice editor' data-website='https://www.mdnice.com' >
If you want to learn to crawl, you must be familiar with HTML code. If you don't know HTML code, we There is no way to analyze the page structure, and there is no way to do better analysis.

One, HTML work Principle

  • Principle

  • html is the abbreviation of hypertext markup lanaguage. It is an interpreted language that does not need to be compiled and is interpreted and executed by the browser
  • html composition
    • html is responsible for data display
    • css is responsible for beautifying the page
    • js page dynamics

    < span >Second, know the label

    The introduction of the marquee label, the learning label should:
    • Remember function
    • Label writing
      • < section >can be divided into: single label and pair label
     < span ><!--Single label -->
    <Tag name attribute 1=' attribute value ' attribute 2='attribute value' attribute 3=attribute value>

    <!--Tags -->
    <Tag Name Attribute 1='attribute value' attribute2='attribute value' attribute3=attribute value< /span>>Content</Tag name>
    Notes :
    • Label cannot be created
    • When writing labels, you should use English half-width
    • Attribute values ​​can be single quotes or double quotes It is recommended to use single quotes to enclose it.
    • The attribute must be written in the start tag
    • Tags can be nested, one tag should be completely nested in another tag

    Third, global architecture label (key)

    <!doctype html> document type, html table name is html5 document
    <html> Root tag
    <head> Head
    <meta charset='UTF-8'> tells the browser to interpret the document in utf-8 encoding format
    <title> Document</title> Document title
    </head>
    <body>

    </body>
    </html>

    3.1 title tag

     <title>Document</ title>

    Set the title of the document and display the title bar of the window

    3.2 Set character set

    <meta charset= 'UTF-8'> 

    The character set tells the browser to use that encoding format to interpret the html document , note that the html document itself has an encoding format, the two encoding formats must be consistent, if inconsistent, the code will be garbled

    3.3 body

    (understand) the content display area, some common attributes :

    • topmargin top margin
    • < section >leftmargin left margin
    • text text color
    • bgcolor background color
    • background background image, which conflicts with bgcolor, if the background image is set, the background color is not displayed

    3.4 Global attributes

    Every tag has attributes, commonly used are id, class, name, style

    Four, html common tags

    html file display features: Multiple spaces, newlines, and tab keys are replaced by a space; if there is no space between English characters, it will be regarded as a word, No automatic line break

    4.1 Common tags

    • h1~h6 title, generally only one h1 title is set for a page
    • hr (single label) horizontal dividing line
      • width can be used Absolute value, 300, without unit, you can also use percentage 50%
      • align alignment: left center (default) right
    • p paragraph tag, with spacing before and after the paragraph
    • br (single label) line break
    • nobr (double label) no line break, no matter how long the modified content is, it will not automatically wrap , it cannot be displayed, there will be a scroll bar
    • pre keep the original style, no matter the space or the line break will be displayed normally
    • b (strong) bold
    • i (cite, em) italic
    • u underline
    • sub/sup subscript/superscript, Look at the circle over there, and below it is the subscript
    • font (face/color/size) font
      < /li>
      • Face font name, go to the font subdirectory under the window directory to view
      • < li >
        color font color
      • size value is 1~7, the maximum is 7
        < /li>
    • blockquote quotes will be separated from normal text, leaving left and right margins

    4.2 Comments and entity references

    • Comment
     <!--I am comment --> 
    <!--
    I am a comment

    -->
    The role of comments:
    1 Improve the readability of the code, mainly for other team members to see, Convenient for maintenance
    2 Convenient for debugging
    • entity Quote
     space  
    < <
    > >< br> & & &
    ' "
    ' '

    5. List

    5.1 ordered list (ol/li)

    • type: number, a ,A,I ,i
    • start start label, default starts from 1

    5.2 Unordered list (ul/li)

    • type bullet:
      • disc default solid circle
      • square solid square
      • circle hollow circle

    5.3 Custom list (dl/dt/dd)

    6. Hyperlink (key)

    • How to write hyperlinks
     < a href='http://www.baidu.com/'>Baidu</a>
    • url Uniform Resource Locator
     https://baike.baidu.com: 80/item/%E6%9D%A8%E5%B9%82/149851?fr=aladdin#4
    Part 1: protocol http https ftp news magnet (magnetic link)
    Part 2: host, server address can be domain name or ip address
    Part 3: number after colon, port http 80(default) https :443 ftp:21 mysql: 3306
    The port numbers range from 0 to 65535, of which 0 to 1023 are used by the operating system
    Part 4: From the slash after the port to? , the middle part is called the path, the path of the requested file
    Part 5: From? To #The middle part is the request parameter, query string; writing method: key=value&key2=value
    Part 6: anchor point is the jump in the same page, must Start with #
    • href the requested url, note that the url must write the protocol

    • title Tips displayed when the mouse is placed on the hyperlink

    • target

      • _blank new window opens
      • _self The current window is open, default

    Seven, img tag (single label)

    < span >img is a single label,<img src='' title='' alt='' border='' width='' height=''>
    • Absolute path and relative path
     Local absolute path: file:///C:/python/web/1/ym.jpg
    Network Absolute path: https://gss0.bdstatic.com/94o3dSag_xI4khGkpoWK1HF6hhy/baike/c0%3Dbaike80%2C5%2C5%2C80%2C26/sign=32ceb0ef04d79123f4ed9c26cc5d32e7/7c1ed21b0ef41bd555 20081359da81cb38db3de2.jpg
    The absolute path of the website (understand): / stands for the root of the website Directory

    Relative path: relative to the directory where the html document is located ../superior directory./current directory ./3/index.html ../1.html
    < ul data-tool='mdnice editor' class='list-paddingleft-2'>
  • src image source, it can be a relative path or an absolute path

  • title picture prompt text

  • Prompt text when the alt image is not displayed

  • border Image border size, generally the default is 0

  • In general, only one width/height is set, and the other is proportionally scaled

  • 8. Form

    • table
      • border line
      • cellspacing: cell spacing
      • cellpadding: cells Distance to content
      • align: horizontally align left, center, right
      • height can be opened automatically without setting
    • tr row
      • align : horizontal alignment left center right
      • valign: Vertically align top middle bottom
      • Note: If the height is not set for the form, then setting valign is invalid, When laying out pages in the future, valign is generally not used. There is only one case where it is used, that is, when pictures and text are arranged side by side, you need to set the valign of the picture to middle
    • td cell
      • colspan Merge right across columns
      • rowspan Merge down across rows
      • width/height
    • th
      • It is the header of the table, the content will be bold, the same as td
    • caption table title, follow the table movement
     <table border=1 width=100>
    <tr align='left'>
    <td>A</ td>
    <td>A</td>
    <td>A</td>
    </tr>
    <tr align='left'>
    <td>B</td>
    <td>B</td>
    <td>B</td>
    </tr>
    > </table>

    < strong >9. Form (important)

    Purpose: Collect user information and submit it to the server

    Basic use:

    • Not all tags can be submitted, only form items that can submit information to the server: input, select, textarea before you can submit information to the server

    • Form items must be placed in the form tag to submit information

    • action: submit address, usually the page of the server

    • method: submission method, the most important two are get method and post method, the default is get submission

    The difference between get and post:
    1.get is used to request data from the server, and post is generally used to submit data to the server
    2. The get request transmits data through the url, and the data will be exposed in the browser address bar, which is not safe;
    The post request data is in the request body and will not appear in the browser address bar, which is relatively safe
    3. get transmission Parameters, the data size is limited by the url, generally a few kilobytes in size
    Post parameters are theoretically unlimited
    • enctype: used for file upload, the value is: multipart/form-data, understand now

    • input box

      • type: text

      • placeholder: placeholder, generally used to prompt the user, when the user enters, it will disappear automatically

      • maxlength: maximum number of characters

      • type type: single-line text box (text), password box (password), check box (checkbox), radio box (radio), file upload ( file), button (button), reset button (reset), submit (submit)
      • name: name, to submit, must set name
      • value default
      • readonly read-only
      • readonly
      • readonly span>
      • disabled unavailable
      • size
      • Public attributes: type, name, value, readonly, disabled, size

      • Single line text box

     <input type='text' value=' Reset' />
    • Submit button

      • type: submit

      • value: Title of the submit button

     <input type='submit' value=' Submit' />
    • Reset button

      • Empty user input

      • type: reset

     <input type='reset' value=' Reset' />
    • Password box

      • type: password
       <input type='password' name='password' /> 
      • Radio box

        Generally used to select one from multiple, the same name is a group, only one can be selected in a group

        • type: radio

        • checked: Whether to check

        • value: generally represented by 0 or 1, must be set, otherwise the server cannot distinguish which one is selected

       <input type='radio' name=' ball1' checked value ='basketball'/> basketball
      • < section >

        Check box

        The general name value is the same

        • type: checkbox

        • value: must be set

        • checked: Checked

       <input type='checkbox' name=' ball1' checked/> basketball
      <input type='checkbox' name='ball2' checked/ > Football
      • File upload

        • type: file
         <input type='file' name='upload' enctype=''< /span>/>
        • Hide button

          Generally used to submit data that does not require user input

          • type: hidden

          • name and value must be set

         <input type='hidden' name=' a' value='123'/>
        • button is generally used with js code
         <button>button</button>
        • Drop-down box (select)

          • selected: whether to select
          • value needs to be set, otherwise the value is the text in the middle of option
          • name must be set
          • size: the number of rows displayed, if this property is set, the drop-down box will become a list box
          • multiple: whether multiple lines can be selected
           <select name=''>
          <option>1</option>
          <option>2</option>
          <option>3</option>
          </select>
          • < p >Multiline text (textarea)

            • cols: number of columns
            • rows: number of rows
            • Note that the content of the textarea tag cannot have any value, otherwise it will be displayed
           <textarea cols=10 rows=6> </textarea>
          • label

            • Used with radio and checkbox, it is convenient for users to select
           <input type='radio' name='sex' value= 'man' checked id='man'> <label for='man'> </label>

          10. The use of developer tools

          Of course CSS and JavaScript are also essential in a standard web page of. How do we analyze a web page? Take chrome as an example
          We visit: https://movie.douban.com/

          Element positioning of each page is completed according to the following steps.
          Let's take a look at the use of Network.
          The Network is blank at this time, we need to refresh the page again.

          Next, look at the request and the corresponding content

          Through these views, you can set the crawler's request header and other information.
          Series of Articles
          Day 1: Bookmark! Master the actual process of data analysis in the enterprise in one article
          Day 2: Jupyter Notebook introduction, installation and usage tutorial
          Part 3 days: If you don't understand Numpy, please don't say you are a Python programmer /section>
          Day 5: Operation and splicing in Numpy arrays, just read this article...
          Day 6: Favorites! Summary of commonly used methods and attributes in Numpy
          Day 7: The rookie uses Python to operate MongoDB, just read this article
          Day 8: Application of redis in crawlers
          -END-
          Python column
          About Python is here

    Articles are uploaded by users and are for non-commercial browsing only. Posted by: Lomu, please indicate the source: https://www.daogebangong.com/en/articles/detail/HTML%20structure%20analysis.html

    Like (810)
    Reward 支付宝扫一扫 支付宝扫一扫
    single-end

    Related Suggestion