โœŠ ํ•„์˜ค์˜ ๊ฐœ๋ฐœ์ผ์ง€
Back to Posts
2018๋…„ 9์›” 24์ผ

CodeSpitz78 3/ (stack ๊ตฌ์กฐ) HTML parser โœจโœจ

CodeSpitz78 3/ (stack ๊ตฌ์กฐ) HTML parser โœจโœจ

๐ŸŒ•๐ŸŒ‘๐ŸŒ‘

๐Ÿ”ฅ ์ฝ”๋“œ์Šคํ”ผ์ธ  ์ˆ˜์—…์„ ์ˆ˜๊ฐ•ํ•˜๋ฉด์„œ ๋ณต์Šตํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ  : ๋ Œ๋”๋ง ์—”์ง„ - ํŒŒ์‹ฑย 


1. ๊ฐœ์š” ์–ด๋–ค ์ƒํ™ฉ์„ ๋ณด๊ณ  ๊ตฌ์กฐ์ ์ด๊ณ  ์žฌ๊ท€์ ์ธ ํ˜•ํƒœ๋กœ ํŒŒ์•…์„ ํ•  ์ˆ˜ ์žˆ๋А๋ƒ, ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ•  ์ˆ˜

์žˆ๋А๋ƒ..

BNF

<๊ธฐํ˜ธ> ::= <ํ‘œํ˜„์‹>

html ๊ธฐ๋ณธ ํŒจํ„ด

A = <tag>body</tag> B = <tag /> C = text body = (A|B|C)N

์œ„์˜ ๊ทธ๋ฆผ์„ ํŒŒ์‹ฑํ•  ์ˆ˜ ์žˆ๋Š” ํŒŒ์„œ๋ฅผ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค! ์ผ€์ด์Šค๊ฐ€ ์žฌ๊ท€๋ฉด์„œ ๋ณตํ•ฉ์ ์ธ ์ƒํ™ฉ์„ ์งค ์ˆ˜ ์žˆ๋‹ค๋ฉด ์ค‘๊ธ‰๊ฐœ๋ฐœ์ž

ํŒŒ์„œ์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ

๋ฌธ์ž์—ด์„ ์ฝ์–ด์„œ ๊ตฌ์กฐ์ ์œผ๋กœ ๊ฐ์ฒดํ™” ์‹œ์ผœ ๋ฆฌํ„ดํ•˜๊ฒŒ ํ•˜๊ณ  ์‹ถ๋‹ค.

interface Result { name?: string; type: string; // text, node children: Result[]; } interface StackItem { tag: Result; } const parser = (input: string): Result[] => { input = input.trim(); // ์ดˆ๊ธฐํ™” const result = { name: 'ROOT', type: 'node', children: [] }; const stack = [{ tag: result }]; let curr, i = 0, j = input.length; while ((curr = stack.pop())) { while (i < j) { const cursor = i; // ํ—ท๊ฐˆ๋ฆฌ๊ธฐ๋•Œ๋ฌธ์— ์กฐํšŒ์šฉ์œผ๋กœ ๋”ฐ๋กœ if (input[cursor] === '<') { // '<'๋กœ ์‹œ์ž‘ํ•˜๋ฉด ํƒœ๊ทธ: A,B์˜ ๊ฒฝ์šฐ } else { // ํ…์ŠคํŠธ ํƒ€์ž…: C์˜ ๊ฒฝ์šฐ } } } return result; };

์ค‘์ฒฉ๋ฃจํ”„


2. ํ…์ŠคํŠธ ๋…ธ๋“œ Cํƒ€์ž…: ํ…์ŠคํŠธ

2-1. ์ˆœ์„œ

๋‹ค์Œ ํƒœ๊ทธ๊นŒ์ง€๋ฅผ ํŒŒ์•…

interface Result { name?: string, type: string, // text, node children: Result[] } interface StackItem { tag: Result } const parser = (input: string): Result[] => { input = input.trim(); const result = {name: 'ROOT', type: 'node', children: []}; const stack = [{tag: result}]; let curr, i = 0, j = input.length; while(curr = stack.pop()){ while(i < j){ const cursor = i; if(input[cursor] === '<'){ // '<'๋กœ ์‹œ์ž‘ํ•˜๋ฉด ํƒœ๊ทธ: A,B์˜ ๊ฒฝ์šฐ } else { // ํ…์ŠคํŠธ ํƒ€์ž…: C์˜ ๊ฒฝ์šฐ ๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡ const idx = input.indexOf('<', cursor); // cursor์œ„์น˜์—์„œ๋ถ€ํ„ฐ < ์œ„์น˜: ๋‹ค์Œ ํƒœ๊ทธ ์ „๊นŒ์ง€ curr.tag.children.push({ type: 'text', text: input.substring(cursor, idx); }); // children์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— children์€ ํŒจ์Šค i = idx; // ๋‹ค์Œ ์‹œ์ž‘์ง€์ ์„ ์˜ฎ๊น€ ๐Ÿ‘†๐Ÿ‘†๐Ÿ‘† } } }; return result; }

2-2. ๋ถ„๋ฆฌ

์—ญํ• ์„ ์ธ์‹ํ•˜์ž๋งˆ์ž!! ๋ถ„๋ฆฌํ•˜์ž. ๋‚˜์ค‘์— ๋ถ„๋ฆฌํ•  ๋•Œ๋Š” ์ด๋ฏธ ์˜ค์—ผ๋˜์–ด์žˆ์–ด์„œ ๋ถ„๋ฆฌ์‹œํ‚ค๊ธฐ ์–ด๋ ต๋‹ค.

const textNode = (input: string, cursor: number, curr: StackItem): number => { const idx = input.indexOf('<', cursor); curr.tag.children.push({ type: 'text', text: input.substring(cursor, idx); }); return idx; }
interface Result { name?: string, type: string, // text, node children: Result[] } interface StackItem { tag: Result } const parser = input => { input = input.trim(); const result = {name: 'ROOT', type: 'node', children: []}; const stack = [{tag: result}]; let curr, i = 0, j = input.length; while(curr = stack.pop()){ while(i < j){ const cursor = i; // ํ—ท๊ฐˆ๋ฆฌ๊ธฐ๋•Œ๋ฌธ์— ์กฐํšŒ์šฉ์œผ๋กœ ๋”ฐ๋กœ if(input[cursor] === '<'){ // '<'๋กœ ์‹œ์ž‘ํ•˜๋ฉด ํƒœ๊ทธ: A,B์˜ ๊ฒฝ์šฐ } else { ๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡ i = textNode(input, cursor, curr) ๐Ÿ‘†๐Ÿ‘†๐Ÿ‘† } } }; return result; } // ๋ชฉํ‘œ: textNode๋ฅผ ์ƒ์„ฑ + result ๋ฐฐ์—ด์— ์ถ”๊ฐ€ => idx๊ฐ’(ํ…์ŠคํŠธ ๋…ธ๋“œ์˜ ๋งˆ์ง€๋ง‰ ์ˆœ์„œ๊ฐ’) ๋ฐ˜ํ™˜ const textNode = (input: string, cursor: number, curr: StackItem): number => { const idx = input.indexOf('<', cursor); curr.tag.children.push({ type: 'text', text: input.substring(cursor, idx); }); return idx; }

์ฝ”๋“œ๋ฅผ ์งค ๋•Œ๋Š” ๋ฌด์กฐ๊ฑด ์‰ฌ์šด ๊ฒƒ ๋ถ€ํ„ฐ ์ฒ˜๋ฆฌํ•œ๋‹ค.


3. ํƒœ๊ทธ ๋…ธ๋“œ

<๊ฐ€ ๋ฐœ๋™ ํŠธ๋ฆฌ๊ฑฐ๋กœ ์“ฐ์ด๊ณ  ์žˆ๋‹ค.

ํŠธ๋ฆฌ๊ฑฐ๊ฐ€ ๋ฐœ๋™๋˜๋Š” ์ผ€์ด์Šค๊ฐ€ 3๊ฐ€์ง€ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋‹ค.

  1. ์‹œ์ž‘ํƒœ๊ทธ
  2. ๋‹ซ๋Š”ํƒœ๊ทธ
  3. ์™„๋ฃŒํƒœ๊ทธ

๊ณตํ†ต์ ์„ ์ฐพ์•„์„œ ์ฝ”๋“œ๋ฅผ ์ค‘๋ณต์‹œํ‚ค๋Š” ๊ฒƒ์„ ํ”ผํ•ด์•ผํ•œ๋‹ค. ๋ˆˆ์„ ํ›ˆ๋ จํ•ด์„œ ๋จผ์ € ๊ณตํ†ต์š”์†Œ๋ฅผ ์ถ”์ƒํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ํ‚ค์›Œ์•ผํ•œ๋‹ค.

3.1 empty element <img />, open tag <div>

empty element๊ฐ€ ๋” ๊ฐ„๋‹จํ•ด ๋ณด์ด๋ฏ€๋กœ ๋จผ์ €์งœ๊ธ”

const parser = input => { input = input.trim(); const result = {name: 'ROOT', type: 'node', children: []}; // ๋ฆฌํ„ด๊ฐ’ : ์ด๋ฆ„์ด ๋ญ๊ณ , ํƒ€์ž…์ด ๋ญ๊ณ , children์„ ๋ญ˜ ๊ฐ–๊ณ  ์ž‡๋Š”์ง€. const stack = [{tag: result}]; // DOM ๊ฐ์ฒด let curr, i = 0, j = input.length; while(curr = stack.pop()){ while(i < j){ const cursor = i; // ํ—ท๊ฐˆ๋ฆฌ๊ธฐ๋•Œ๋ฌธ์— ์กฐํšŒ์šฉ์œผ๋กœ ๋”ฐ๋กœ if(input[cursor] === '<'){ // '<'๋กœ ์‹œ์ž‘ํ•˜๋ฉด ํƒœ๊ทธ: A,B์˜ ๊ฒฝ์šฐ // ํ˜„์žฌ cursor |< const idx = input.indexOf('>', cursor); // ํ˜„์žฌ idx |> i = idx + 1; if(input[cursor + 1] === '/'){ // close ํƒœ๊ทธ } else { ๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡ if(input[idx - 1] === '/'){ // empty ํƒœ๊ทธ } else { // open ํƒœ๊ทธ } } ๐Ÿ‘†๐Ÿ‘†๐Ÿ‘† } else i = textNode(input, cursor, curr) } }; return result; } const textNode = (input, cursor, curr) => {...};

cf__1 ์ฝ”๋“œ ์„ค๊ณ„๋ฅผ ์ž˜ํ•˜์ž ์ข‹์€ ์ฝ”๋“œ๋ฅผ ์งœ๋Š” ๋น„๋ฐ€์€ - ํ…Œ์ŠคํŠธ ์ฃผ๋„ ๊ฐœ๋ฐœ์— ์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, -

๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์žฌ๊ท€์ ์ธ ๋กœ์ง์„ ์ฐพ์•„๋‚ด๊ฑฐ๋‚˜ - ์ถ”์ƒํ™”๋œ ๊ณตํ†ต์ ์„ ์ฐพ์•„๋‚ด๊ฑฐ๋‚˜ - ์—ญํ• ์„ ์ดํ•ดํ•˜๊ฑฐ๋‚˜์— ์žˆ๋‹ค.

๋จธ๋ฆฟ์†์— ๋งจํ†จ๋ชจ๋ธ์ด ๊ทธ๋ ค์ง€๋ฉด ์ฝ”๋“œ๋กœ ๋˜‘๊ฐ™์ด ํ‘œํ˜„๋˜์–ด์•ผ์ง€ ์ •์ƒ์ด๋‹ค.

๋ฐ”๋ฅธ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋ง์ด ๋ผ์—ˆ์œผ๋ฉด

let name, isClose; // ๐Ÿ‘ˆ ๊ณตํ†ต ์ค€๋น„์‚ฌํ•ญ ์ธต if (input[idx - 1] === '/') { // ๐Ÿ‘ˆ ๊ณตํ†ต ์ฒ˜๋ฆฌ์‚ฌํ•ญ ์ธต ~ // empty element ํƒœ๊ทธ // name์€ '<'์™€ '>'์‚ฌ์ด name = input.substring(cursor + 1, idx - 1); isClose = true; } else { // open ํƒœ๊ทธ name = input.substring(cursor + 1, idx); isClose = false; } const tag = { name, type: 'node', children: [] }; // ๐Ÿ‘ˆ ํก์ˆ˜ํ•˜๋Š”? ์ธต curr.tag.children.push(tag);

์ผ€์ด์Šค๋Š” ๋‹ค ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค.

์œ„์˜ ์ฝ”๋“œ ํ˜•ํƒœ

  1. ๊ณตํ†ต ์ค€๋น„์‚ฌํ•ญ ์ธต
  2. ๊ณตํ†ต ์ฒ˜๋ฆฌ์‚ฌํ•ญ ์ธต
  3. ๋‹ค๋ฅธ์ ์„ ๊ธฐ์ˆ ํ•˜๋Š” ๋ถ€๋ถ„์„ ํก์ˆ˜ํ•˜๋Š” ์ธต

cf__2 ํ™”์ดํŠธ๋ฆฌ์Šค ํ™”์ดํŠธ๋ฆฌ์Šค whitelist โ€˜์•ˆ์ „โ€™์ด ์ฆ๋ช…๋œ ๊ฒƒ๋งŒ์„ ํ—ˆ์šฉํ•˜๋Š” ๊ฒƒ์œผ๋กœ โ€˜์•…์˜์„ฑโ€™์ด

์ž…์ฆ๋œ ๊ฒƒ์„ ์ฐจ๋‹จํ•˜๋Š” ๋ธ”๋ž™๋ฆฌ์ŠคํŠธ ๋ณด์•ˆ๊ณผ ์ƒ๋ฐ˜๋˜๋Š” ๋ณด์•ˆ ๋ฐฉ์‹ ์ด๋‹ค. ํ™”์ดํŠธ๋ฆฌ์ŠคํŠธ, ๋ธ”๋ž™๋ฆฌ์ŠคํŠธ๋ผ๋Š” ์šฉ์–ด ๋Œ€์‹  โ€˜positiveโ€™์™€ โ€˜nagativeโ€™ ๋ณด์•ˆ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ถˆ๋ ค์ง€๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

const idx = input.indexOf('>', cursor); // ํ˜„์žฌ idx |> i = idx + 1; if(input[cursor + 1] === '/'){ // close ํƒœ๊ทธ } else { let name, isClose; if(input[idx - 1] === '/'){ // empty element ํƒœ๊ทธ // name์€ '<'์™€ '>'์‚ฌ์ด name = input.substring(cursor + 1, idx - 1); isClose = true; } else { // open ํƒœ๊ทธ name = input.substring(cursor + 1, idx); isClose = false; } const tag = {name, type: 'node', children: []}; curr.tag.children.push(tag); ๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡ if(!isClose){ stack.push({tag, back:curr}); // โœจโœจ ๋ฆฌํ„ดํฌ์ธํŠธ๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ •์˜ํ•˜๋Š” ์ƒํ™ฉ. break; // while๋ฌธ์˜ break } ๐Ÿ‘†๐Ÿ‘†๐Ÿ‘† }

๋ถ„๋ฆฌ ~

const idx = input.indexOf('>', cursor); // ํ˜„์žฌ idx |> i = idx + 1; if(input[cursor + 1] === '/'){ // close ํƒœ๊ทธ } else { if(elementNode(input, cursor, idx, curr, stack)) break; } ... const elementNode = ( input: string, cursor: number, idx: number, curr: StackItem, stack: StackItem[]): boolean => { let name, isClose; if(input[idx - 1] === '/'){ // empty element ํƒœ๊ทธ // name์€ '<'์™€ '>'์‚ฌ์ด name = input.substring(cursor + 1, idx - 1); isClose = true; } else { // open ํƒœ๊ทธ name = input.substring(cursor + 1, idx); isClose = false; } const tag = {name, type: 'node', children: []}; curr.tag.children.push(tag); if(!isClose){ stack.push({tag, back:curr}); // ๋ฆฌํ„ดํฌ์ธํŠธ๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ •์˜ํ•˜๋Š” ์ƒํ™ฉ. return true; } return false; }

cf__3 ๊ฐ€๋…์„ฑ์ด ๋†’์€ ์ฝ”๋“œ? (๋ณ€์ˆ˜๋ช…์„ ์ด์˜๊ฒŒ ์“ฐ๋˜, ์ปจ๋ฒค์…˜์„ ์ง€ํ‚ค๋˜) ์ฝ”๋“œ๊ฐ€ ๋ฆฌ๋”๋ธ”ํ•˜๋‹ค?

Readable - ์ ์ ˆํ•œ ์—ญํ• ๋ชจ๋ธ๋กœ ์œ„์ž„๋˜์„œ ๊ทธ๋“ค๊ฐ„์˜ ํ†ต์‹ ๊ณผ ํ˜‘์—…๋งŒ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ฝ”๋“œ๊ฐ€ ๊ฐ€๋…์„ฑ์ด ๋†’์€ ์ฝ”๋“œ์ด๋‹ค.

3.2 close tag

stack.push({ tag, back: curr });
... const idx = input.indexOf('>', cursor); // ํ˜„์žฌ idx |> i = idx + 1; if(input[cursor + 1] === '/'){ ๐Ÿ‘‡๐Ÿ‘‡๐Ÿ‘‡ // close ํƒœ๊ทธ curr = curr.back; ๐Ÿ‘†๐Ÿ‘†๐Ÿ‘† } else { if(elementNode(input, cursor, idx, curr, stack)) break; } ... const elementNode = (input, cursor, idx, curr, stack) => { let name, isClose; if(input[idx - 1] === '/'){ // empty element ํƒœ๊ทธ // name์€ '<'์™€ '>'์‚ฌ์ด name = input.substring(cursor + 1, idx - 1); isClose = true; } else { // open ํƒœ๊ทธ name = input.substring(cursor + 1, idx); isClose = false; } const tag = {name, type: 'node', children: []}; curr.tag.children.push(tag); if(!isClose){ stack.push({tag, back:curr}); // ๋ฆฌํ„ดํฌ์ธํŠธ๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ •์˜ํ•˜๋Š” ์ƒํ™ฉ. return true; } return false; }

cf__4 css์••์ถ•์ด๋‚˜ javascript์••์ถ•๋ณด๋‹ค html์••์ถ•์ด ๋ธŒ๋ผ์šฐ์ €์˜ ๋ถ€ํ•˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ• ์“ธ๋ฐ์—†๋Š”

๋…ธ๋“œ์ƒ์„ฑ์„ ์ค„์ธ๋‹ค.

const elementNode = (input, cursor, idx, curr, stack) => { const isClose = input[idx - 1] === '/'; const tag = { name: input.substring(cursor + 1, idx - (isClose ? 1 : 0)), type: 'node', children: [], }; curr.tag.children.push(tag); if (!isClose) { stack.push({ tag, back: curr }); // โœจโœจ ๋ฆฌํ„ดํฌ์ธํŠธ๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ •์˜ํ•˜๋Š” ์ƒํ™ฉ. return true; } return false; };

4. ์ตœ์ข…

const parser = input => { input = input.trim(); const result = { name: 'ROOT', type: 'node', children: [] }; const stack = [{ tag: result }]; // DOM ๊ฐ์ฒด let curr, i = 0, j = input.length; while ((curr = stack.pop())) { while (i < j) { const cursor = i; // ํ—ท๊ฐˆ๋ฆฌ๊ธฐ๋•Œ๋ฌธ์— ์กฐํšŒ์šฉ์œผ๋กœ ๋”ฐ๋กœ if (input[cursor] === '<') { const idx = input.indexOf('>', cursor); i = idx + 1; if (input[cursor + 1] === '/') { // ๋‹ซ๋Š”ํƒœ๊ทธ curr = curr.back; } else { if (elementNode(input, cursor, idx, curr, stack)) break; // ๋‘๋ฒˆ์งธ ๋ฃจํ”„ break } } else i = textNode(input, cursor, curr); } } return result; }; const textNode = (input, cursor, curr) => { const idx = input.indexOf('<', cursor); curr.tag.children.push({ type: 'text', text: input.substring(cursor, idx), }); return idx; }; const elementNode = (input, cursor, idx, curr, stack) => { const isClose = input[idx - 1] === '/'; const tag = { name: input.substring(cursor + 1, idx - (isClose ? 1 : 0)), type: 'node', children: [], }; curr.tag.children.push(tag); if (!isClose) { stack.push({ tag, back: curr }); // โœจโœจ ๋ฆฌํ„ดํฌ์ธํŠธ๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ •์˜ํ•˜๋Š” ์ƒํ™ฉ. return true; } return false; };
PreviousCodeSpitz78 4/ OOAD์™€ ํ…ŒํŠธ๋ฆฌ์Šค (1)
NextCodeSpitz78 2/ ๋ฃจํ‹ด ์‹ฌํ™”

Related

ยฉ 2025 Felix