<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>##뚝딱뚝딱 딥러닝##</title>
    <link>https://ga02-ailab.tistory.com/</link>
    <description></description>
    <language>ko</language>
    <pubDate>Tue, 14 Apr 2026 14:37:04 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>ga.0_0.ga</managingEditor>
    <image>
      <title>##뚝딱뚝딱 딥러닝##</title>
      <url>https://tistory1.daumcdn.net/tistory/5940760/attach/e3380a5273b343d788cc3dde2afd8152</url>
      <link>https://ga02-ailab.tistory.com</link>
    </image>
    <item>
      <title>[4] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs</title>
      <link>https://ga02-ailab.tistory.com/197</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;[paper] &lt;a href=&quot;https://arxiv.org/pdf/2602.06040&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2602.06040&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;[Github] &lt;a href=&quot;https://github.com/Accio-Lab/SwimBird&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/Accio-Lab/SwimBird&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1773321740507&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - Accio-Lab/SwimBird&quot; data-og-description=&quot;Contribute to Accio-Lab/SwimBird development by creating an account on GitHub.&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/Accio-Lab/SwimBird&quot; data-og-url=&quot;https://github.com/Accio-Lab/SwimBird&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/ozRmG/dJMb88eY5XQ/1d6LgrwYLsctqR4itX7Zh0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/btDNuY/dJMb83krnm8/BLgyBlhYuCRYJJOBfQ49P1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/d8zCfp/dJMb89yb1eZ/dSayoKl57rQKjvqmSsWbM0/img.jpg?width=4088&amp;amp;height=2873&amp;amp;face=0_0_4088_2873&quot;&gt;&lt;a href=&quot;https://github.com/Accio-Lab/SwimBird&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/Accio-Lab/SwimBird&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/ozRmG/dJMb88eY5XQ/1d6LgrwYLsctqR4itX7Zh0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/btDNuY/dJMb83krnm8/BLgyBlhYuCRYJJOBfQ49P1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/d8zCfp/dJMb89yb1eZ/dSayoKl57rQKjvqmSsWbM0/img.jpg?width=4088&amp;amp;height=2873&amp;amp;face=0_0_4088_2873');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - Accio-Lab/SwimBird&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Contribute to Accio-Lab/SwimBird development by creating an account on GitHub.&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&amp;nbsp;&lt;/h4&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000; background-color: #c1bef9;&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;683&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bXVlUK/dJMcaibuXiW/cIXfz8UQp6uTkJDHBbkVfK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bXVlUK/dJMcaibuXiW/cIXfz8UQp6uTkJDHBbkVfK/img.png&quot; data-alt=&quot;Modality Redundancy: 시각화가 필요 없는 텍스트 문제에도 억지로 이미지 상상을 포함 Modality Mismatch: 시각 기반 추론이 필요한데도 텍스트 중심 설명만 하거나, 시각 정보와 텍스트가 따로 노는 경우&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bXVlUK/dJMcaibuXiW/cIXfz8UQp6uTkJDHBbkVfK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbXVlUK%2FdJMcaibuXiW%2FcIXfz8UQp6uTkJDHBbkVfK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;833&quot; height=&quot;683&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;683&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;Modality Redundancy: 시각화가 필요 없는 텍스트 문제에도 억지로 이미지 상상을 포함 Modality Mismatch: 시각 기반 추론이 필요한데도 텍스트 중심 설명만 하거나, 시각 정보와 텍스트가 따로 노는 경우&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;기존 MLLM모델들의 추론패턴 문제점을 지적&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;항상 텍스트 CoT만 사용하거나 latent Visual Token만 사용하는 방식&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&amp;rArr; 질문마다 필요한 사고방식이 다른데 왜 항상 같은 추론 패턴을 사용할까?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;다양한 유형의 쿼리에 유연하게 대응하지 못하는 한계 직면&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;텍스트 중심의 논리적 문제에서 불필요한 시각적 연산을 초래하여 성능을 저하시키거나, 반대로 시각 정보를 주로 사용해야 하는 문제에서 텍스트만으로는 표현할 수 없는 정보의 손실을 야기&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;이를 해결하기 위해 SwimBird는..&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;질의에 따라 text-only / vision-only / vision-text interleave의 3가지 모드를&amp;nbsp;모델이 스스로 선택하도록 만들고, latent token 길이도 문제 난이도에 따라&amp;nbsp;동적으로 할당하도록 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;text-only 추론: &amp;lt;reason&amp;gt;&amp;hellip;&amp;lt;/reason&amp;gt;로 표기되는 텍스트 CoT 중심&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;vision-only 추론: &amp;lt;|latent_start|&amp;gt; &amp;hellip; &amp;lt;|latent_end|&amp;gt;&amp;nbsp;구간에서&amp;nbsp;&lt;b&gt;연속 잠재 토큰(임베딩)&lt;/b&gt;&amp;nbsp;을 생성하며, 텍스트 CoT를 최소화.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;vision-text interleave 추론: 필요할 때 latent token과 텍스트 추론을 번갈아 수행&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;Hybrid Autoregressive를 사용하여 텍스트 토큰에 대한 예측과 visual token의 임베딩 예측을 단일 프레임워크 내에서 통합 &amp;rarr; 위의 3가지 모드 선택 자체를 학습한다!&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;2. Method&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;829&quot; data-origin-height=&quot;377&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/XU2Gs/dJMcahDClEE/Bcy3vInpiZKNcVfGcJxPjK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/XU2Gs/dJMcahDClEE/Bcy3vInpiZKNcVfGcJxPjK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/XU2Gs/dJMcahDClEE/Bcy3vInpiZKNcVfGcJxPjK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FXU2Gs%2FdJMcahDClEE%2FBcy3vInpiZKNcVfGcJxPjK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;829&quot; height=&quot;377&quot; data-origin-width=&quot;829&quot; data-origin-height=&quot;377&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;3.1 Hybrid Autoregressive Modeling&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;Textual thought as next-token prediction&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;shifted cross-entropy loss사용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;x: 이미지, w: 단어토큰&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;298&quot; data-origin-height=&quot;69&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d4pWnD/dJMcaa5AwX0/QgbXe2ukmIGlvAHS0EaClK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d4pWnD/dJMcaa5AwX0/QgbXe2ukmIGlvAHS0EaClK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d4pWnD/dJMcaa5AwX0/QgbXe2ukmIGlvAHS0EaClK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd4pWnD%2FdJMcaa5AwX0%2FQgbXe2ukmIGlvAHS0EaClK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;298&quot; height=&quot;69&quot; data-origin-width=&quot;298&quot; data-origin-height=&quot;69&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;Visual thought as next-embedding prediction&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;MSE loss 사용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;z: visual latent tokens&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;214&quot; data-origin-height=&quot;74&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cwLbxj/dJMcabQWF4n/5ymfkarY06s7EOwSH8wK5K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cwLbxj/dJMcabQWF4n/5ymfkarY06s7EOwSH8wK5K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cwLbxj/dJMcabQWF4n/5ymfkarY06s7EOwSH8wK5K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcwLbxj%2FdJMcabQWF4n%2F5ymfkarY06s7EOwSH8wK5K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;214&quot; height=&quot;74&quot; data-origin-width=&quot;214&quot; data-origin-height=&quot;74&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;Unified training objective&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;최종 loss&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;247&quot; data-origin-height=&quot;43&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bFrA4a/dJMcahRamMY/f0iviCJJ2d9KY8DgnQagf1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bFrA4a/dJMcahRamMY/f0iviCJJ2d9KY8DgnQagf1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bFrA4a/dJMcahRamMY/f0iviCJJ2d9KY8DgnQagf1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbFrA4a%2FdJMcahRamMY%2Ff0iviCJJ2d9KY8DgnQagf1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;247&quot; height=&quot;43&quot; data-origin-width=&quot;247&quot; data-origin-height=&quot;43&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;Mode switching with special delimiters&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;output space 를 토큰과 임베딩 두 종류로 확장했기에 이를 구분하는 토큰 필요&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;학습시 &amp;lt;|latent_start|&amp;gt; &amp;hellip; &amp;lt;|latent_end|&amp;gt; 로 visual thought 영역 표시&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;모델이 텍스트 토큰이 아닌 연속적인 latent embedding을 생성해야 함을 알려줌&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;system prompt&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2048&quot; data-origin-height=&quot;810&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/xx9Pw/dJMcacPRWty/3R90wuzVdibwzkeWM5WFaK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/xx9Pw/dJMcacPRWty/3R90wuzVdibwzkeWM5WFaK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/xx9Pw/dJMcacPRWty/3R90wuzVdibwzkeWM5WFaK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fxx9Pw%2FdJMcacPRWty%2F3R90wuzVdibwzkeWM5WFaK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2048&quot; height=&quot;810&quot; data-origin-width=&quot;2048&quot; data-origin-height=&quot;810&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;추론 단계에서는 어떻게?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;예를 들어, 모델이 &amp;lt;|latent_start|&amp;gt; 를 출력하면 그 다음부터는 임베딩 생성 단계로 전환 &amp;rarr; &amp;lt;|latent_end|&amp;gt; 를 출력하면 텍스트토큰 생성으로 복귀&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;3.2 Dynamic Latent Token Budget&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;415&quot; data-origin-height=&quot;358&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KHcDy/dJMcac3m8pS/GacAnY8jwh9HDt89wSacIk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KHcDy/dJMcac3m8pS/GacAnY8jwh9HDt89wSacIk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KHcDy/dJMcac3m8pS/GacAnY8jwh9HDt89wSacIk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKHcDy%2FdJMcac3m8pS%2FGacAnY8jwh9HDt89wSacIk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;415&quot; height=&quot;358&quot; data-origin-width=&quot;415&quot; data-origin-height=&quot;358&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;입력 이미지의 해상도에 따라 visual latent token수 조절&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;기존의 고정된 토큰 수의 문제점&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;고해상도 이미지에서는 용량 부족 / 저해상도 이미지는 계산 낭비&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;학습 중간에 중간 이미지를 고정된 길이로 생성해버리면 정보손실 발생 가능성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;SwimBird에서는 해상도를 고려해 동적 latent token 생성&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;Qwen ViT 인코더의 고유 해상도 보존 특성을 활용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;질문 이미지와 중간 단계의 사고 이미지에 대해 각각 다른 최대 픽셀 크기를 할당&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;visual encoder가 생성하는 토큰의 개수가 이미지의 실제 정보량에 비례하도록 설계&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;잠재 토큰의 개수 K는 사전에 정의된 범위 [ N_min, N_max ] 내에서 이미지의 해상도와 쿼리의 난이도에 따라 가변적으로 결정&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;모델은 &amp;lt;/latent&amp;gt;를 출력하여 중지를 결정할 때까지 토큰을 계속 생성 (vision 사고의 정도를 스스로 조절)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;이런 방식의 장점?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;정밀도 유지&lt;/b&gt;: 고해상도의 세밀한 분석이 필요한 이미지의 경우, 많은 Pooling을 피하고 더 많은 잠재 토큰을 할당함으로써 중요한 시각적 단서 보존 가능&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;연산 낭비 방지&lt;/b&gt;: 저해상도이거나 정보 밀도가 낮은 이미지의 경우, 불필요하게 많은 토큰 생성을 억제하여 추론 속도를 높이고 메모리 사용량 감소 가능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;3.3 Switchable Reasoning SFT Dataset Construction&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;switchable reasoning mode학습을 가능하게 하기 위해 데이터큐레이션 파이프라인 설계&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;827&quot; data-origin-height=&quot;151&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/73kWX/dJMcafTmJzH/VvfiXeVplCSJrqH3Rtx8lK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/73kWX/dJMcafTmJzH/VvfiXeVplCSJrqH3Rtx8lK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/73kWX/dJMcafTmJzH/VvfiXeVplCSJrqH3Rtx8lK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F73kWX%2FdJMcafTmJzH%2FVvfiXeVplCSJrqH3Rtx8lK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;827&quot; height=&quot;151&quot; data-origin-width=&quot;827&quot; data-origin-height=&quot;151&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;Step 1) 후보수집 + 쉬운 데이터 제거: ThinkMorph, Zebra-CoT, MathCanvas-Instruct 이용(중간 사고이미지가 포함된 데이터셋들)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;Step 2) 모델을 이용한 라벨링: pass@8 지표 이용, Qwen3-235B-Instruct를 판정자로 사용해 score 0.75 이상인 것만 라벨링&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;Step 3) 텍스트 전용 CoT 추가: OpenMMReasoner 등에서 50,000개의 텍스트 전용 CoT 데이터를 추가&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;3. Experiments&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;Training Details&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;backbone: Qwen3-VL 8B&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;학습: SFT (SwimBird-SFT-92K)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;GPU: A100-80GB&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;bs: 128&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;LLM만 업데이트(vision encoder 와 multimodal projector 는 frozen)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;스케줄러/학습률: cosine LR scheduler, 초기 LR = 1e-5&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;Fine-grained Visual Understanding ( 고해상도 )&lt;/b&gt;&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;604&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eLFQt2/dJMcahDClNa/IchTz5U4O3el566tFjJFxK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eLFQt2/dJMcahDClNa/IchTz5U4O3el566tFjJFxK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eLFQt2/dJMcahDClNa/IchTz5U4O3el566tFjJFxK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeLFQt2%2FdJMcahDClNa%2FIchTz5U4O3el566tFjJFxK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;833&quot; height=&quot;604&quot; data-origin-width=&quot;833&quot; data-origin-height=&quot;604&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;&lt;b&gt;일반 VQA 및 멀티모달 추론&lt;/b&gt;&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;832&quot; data-origin-height=&quot;325&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/biQWN4/dJMcabckEuA/KwMOKgH3B8LmkvNdiRSEak/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/biQWN4/dJMcabckEuA/KwMOKgH3B8LmkvNdiRSEak/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/biQWN4/dJMcabckEuA/KwMOKgH3B8LmkvNdiRSEak/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbiQWN4%2FdJMcabckEuA%2FKwMOKgH3B8LmkvNdiRSEak%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;832&quot; height=&quot;325&quot; data-origin-width=&quot;832&quot; data-origin-height=&quot;325&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot; data-token-index=&quot;0&quot;&gt;- 잠재 토큰 개수와 MSE 가중치에 따른 성능변화&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;834&quot; data-origin-height=&quot;152&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/O35dW/dJMcagrbLXI/pQZggDn7MHcgrO51CIjle1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/O35dW/dJMcagrbLXI/pQZggDn7MHcgrO51CIjle1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/O35dW/dJMcagrbLXI/pQZggDn7MHcgrO51CIjle1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FO35dW%2FdJMcagrbLXI%2FpQZggDn7MHcgrO51CIjle1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;834&quot; height=&quot;152&quot; data-origin-width=&quot;834&quot; data-origin-height=&quot;152&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;table 4: max 토큰 수에 따른 성능&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Light'; color: #000000;&quot;&gt;32 일때 가장 좋은 성능 &amp;rArr; 과도한 잠재 계산이 전체 추론을 방해할 수 있음을 보여줌&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>Paper Review/LLM &amp;amp; VLM</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/197</guid>
      <comments>https://ga02-ailab.tistory.com/197#entry197comment</comments>
      <pubDate>Thu, 12 Mar 2026 22:22:24 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] C++ OpenCV Image Thresholding</title>
      <link>https://ga02-ailab.tistory.com/196</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;c++과 opencv를 이용해 이미지 각 픽셀값을 thresholding하는 방법입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이번 챕터에서는 아래 이미지를 입력으로 사용합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;input_image.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/FjqxJ/dJMcadVa4jc/51odqWPaWUjwBW6bz47NMK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/FjqxJ/dJMcadVa4jc/51odqWPaWUjwBW6bz47NMK/img.jpg&quot; data-alt=&quot;input image&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/FjqxJ/dJMcadVa4jc/51odqWPaWUjwBW6bz47NMK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FFjqxJ%2FdJMcadVa4jc%2F51odqWPaWUjwBW6bz47NMK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;380&quot; height=&quot;285&quot; data-filename=&quot;input_image.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;input image&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;계속 사용하게 될 threshold 함수의 원형은 아래와 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1770619990738&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;threshold(입력이미지, 출력이미지, 임계값, 최대값, 방식)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Binary Thresholding&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;1287&quot; data-start=&quot;1261&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값이 &lt;b&gt;1~255&lt;/b&gt; &amp;rarr; 흰색(255)&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;1308&quot; data-start=&quot;1288&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값이 &lt;b&gt;0&lt;/b&gt; &amp;rarr; 검정(0)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;pre id=&quot;code_1770619172590&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
  // Read image
  Mat src = imread(&quot;threshold.png&quot;, IMREAD_GRAYSCALE);
  Mat dst;
  
  // Set threshold and maxValue
  double thresh = 0;
  double maxValue = 255;
  
  // Binary Threshold
  threshold(src,dst, thresh, maxValue, THRESH_BINARY);
  imwrite(&quot;opencv-threshold-example.jpg&quot;, dst);
  return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;opencv-threshold-example.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nbdfN/dJMcacPyEkn/zd08qi2TudmJGG1ZWNBcG1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nbdfN/dJMcacPyEkn/zd08qi2TudmJGG1ZWNBcG1/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nbdfN/dJMcacPyEkn/zd08qi2TudmJGG1ZWNBcG1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnbdfN%2FdJMcacPyEkn%2Fzd08qi2TudmJGG1ZWNBcG1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;392&quot; height=&quot;294&quot; data-filename=&quot;opencv-threshold-example.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;441&quot; data-start=&quot;427&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값이 0 ~ 127 &amp;rarr; 검정&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;458&quot; data-start=&quot;442&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값이 128 ~ 255 &amp;rarr; 흰색&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;opencv-threshold-example.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/xqwGe/dJMcaiWuSx3/IPzVH9aZE1uYzKgfIKkpo1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/xqwGe/dJMcaiWuSx3/IPzVH9aZE1uYzKgfIKkpo1/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/xqwGe/dJMcaiWuSx3/IPzVH9aZE1uYzKgfIKkpo1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxqwGe%2FdJMcaiWuSx3%2FIPzVH9aZE1uYzKgfIKkpo1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;378&quot; height=&quot;284&quot; data-filename=&quot;opencv-threshold-example.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Inverse-Binary Thresholding (THRESH_BINARY_INV)&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값 &amp;gt; 127&amp;nbsp;&amp;nbsp;&amp;rarr; 0&amp;nbsp;&amp;nbsp;&amp;nbsp;(검정)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값&amp;nbsp;&amp;lt;=127&amp;nbsp;&amp;rarr;&amp;nbsp;255&amp;nbsp;(흰색)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1770620546156&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;// Set threshold and maxValue
double thresh = 128;
double maxValue = 255;

threshold(src,dst, thresh, maxValue, THRESH_BINARY_INV);&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;opencv-thresh-binary-inv.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/BSJqO/dJMcadnlCcy/KnLTKxUBRLKlnrE9nyCCL0/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/BSJqO/dJMcadnlCcy/KnLTKxUBRLKlnrE9nyCCL0/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/BSJqO/dJMcadnlCcy/KnLTKxUBRLKlnrE9nyCCL0/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBSJqO%2FdJMcadnlCcy%2FKnLTKxUBRLKlnrE9nyCCL0%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;404&quot; height=&quot;303&quot; data-filename=&quot;opencv-thresh-binary-inv.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. Truncate Thresholding (THRESH_TRUNC)&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;: 임계값보다 큰 값은 잘라서(thresh로) 고정하고,&amp;nbsp; &lt;/b&gt;작은 값은 그대로 유지&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값 &amp;gt; 127&amp;nbsp;&amp;nbsp;&amp;rarr; 127&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값&amp;nbsp;&amp;lt;=127&amp;nbsp;&amp;rarr;&amp;nbsp;원래&amp;nbsp;값&amp;nbsp;유지&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1770620723281&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;  // Set threshold and maxValue
  double thresh = 127;
  double maxValue = 255;
  
  threshold(src,dst, thresh, maxValue, THRESH_TRUNC);&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;opencv-thresh-trunc.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dzdRin/dJMcai91nsz/FXnHU4yHl996sCXI8BcOg1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dzdRin/dJMcai91nsz/FXnHU4yHl996sCXI8BcOg1/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dzdRin/dJMcai91nsz/FXnHU4yHl996sCXI8BcOg1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdzdRin%2FdJMcai91nsz%2FFXnHU4yHl996sCXI8BcOg1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;413&quot; height=&quot;310&quot; data-filename=&quot;opencv-thresh-trunc.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4. Threshold to Zero (THRESH_TOZERO)&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;: &lt;b&gt;임계값보다 작거나 같은 값은 0으로 제거, &lt;/b&gt;큰 값만 그대로 유지&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값 &amp;gt; 127&amp;nbsp;&amp;nbsp;&amp;rarr; 원래 값 유지&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값 &amp;lt;=127 &amp;rarr; 0 (검정)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1770620916226&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;double thresh = 127;
double maxValue = 255;
  
threshold(src,dst, thresh, maxValue, THRESH_TOZERO);&lt;/code&gt;&lt;/pre&gt;
&lt;h3 id=&quot;-zero-thresholding&quot; style=&quot;background-color: #ffffff; color: #333333; text-align: start;&quot; data-ke-size=&quot;size23&quot;&gt;&amp;nbsp;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;opencv-thresh-tozero.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bXZ70w/dJMcahwzAUU/DX4uk0QTb1EMY9yLKkwFH0/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bXZ70w/dJMcahwzAUU/DX4uk0QTb1EMY9yLKkwFH0/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bXZ70w/dJMcahwzAUU/DX4uk0QTb1EMY9yLKkwFH0/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbXZ70w%2FdJMcahwzAUU%2FDX4uk0QTb1EMY9yLKkwFH0%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;365&quot; height=&quot;274&quot; data-filename=&quot;opencv-thresh-tozero.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;5. Inverted Threshold to Zero (THRESH_TOZERO_INV)&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;: &lt;b&gt;임계값보다 큰 값 제거, &lt;/b&gt;작은 값만 유지&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값 &amp;gt; 127&amp;nbsp;&amp;nbsp;&amp;rarr; 0&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;픽셀값&amp;nbsp;&amp;lt;=127&amp;nbsp;&amp;rarr;&amp;nbsp;원래&amp;nbsp;값&amp;nbsp;유지&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1770621095564&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;double thresh = 127;
double maxValue = 255;

threshold(src,dst, thresh, maxValue, THRESH_TOZERO_INV);&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;opencv-thresh-tozero.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bVo8Qw/dJMcabQCSrS/6rkkcL1A4baQ8X34NYna5K/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bVo8Qw/dJMcabQCSrS/6rkkcL1A4baQ8X34NYna5K/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bVo8Qw/dJMcabQCSrS/6rkkcL1A4baQ8X34NYna5K/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbVo8Qw%2FdJMcabQCSrS%2F6rkkcL1A4baQ8X34NYna5K%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;387&quot; height=&quot;290&quot; data-filename=&quot;opencv-thresh-tozero.jpg&quot; data-origin-width=&quot;512&quot; data-origin-height=&quot;384&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;[참고자료]&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;a href=&quot;https://learnopencv.com/opencv-threshold-python-cpp/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/opencv-threshold-python-cpp/&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/196</guid>
      <comments>https://ga02-ailab.tistory.com/196#entry196comment</comments>
      <pubDate>Mon, 9 Feb 2026 16:15:27 +0900</pubDate>
    </item>
    <item>
      <title>[Pytorch] VLM 모델 양자화 (FP8, GPTQ, AWQ)</title>
      <link>https://ga02-ailab.tistory.com/195</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&amp;nbsp;Qwen3_VL 모델을 FP8, GPTQ, AWQ 3가지 방법으로 양자화하는 방법에 대한 설명입니다. FP8은 8bit 양자화, GPTQ와 AWQ는 4bit 양자화라는 차이점도 있고 또한 &quot;언제, 무엇을, 어떻게 양자화하느냐&quot;에 따라서 목적&amp;middot;효과&amp;middot;제약이 꽤 다릅니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 style=&quot;color: #000000;&quot; data-ke-size=&quot;size20&quot; data-path-to-node=&quot;2&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;1. FP8 (8-bit Floating Point)&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;p data-path-to-node=&quot;3&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;FP8은 특정 알고리즘이라기보다 NVIDIA H100(Hopper)이나 B200(Blackwell)과 같은 최신 GPU에서 지원하는&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b data-path-to-node=&quot;3&quot; data-index-in-node=&quot;74&quot;&gt;데이터 형식&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-path-to-node=&quot;4&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-path-to-node=&quot;4,0,0&quot; data-index-in-node=&quot;0&quot;&gt;특징:&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;정수형(INT8)이 아닌 부동소수점 방식을 사용하여 더 넓은 동적 범위(Dynamic Range)를 가진다&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-path-to-node=&quot;4,1,0&quot; data-index-in-node=&quot;0&quot;&gt;장점:&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;최신 GPU의 Tensor Core에서 직접 연산되므로 속도가 매우 빠르다.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;가중치(Weight)뿐만 아니라 활성화값(Activation), KV 캐시까지 모두 FP8로 처리할 수 있다는게 큰 장점&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-path-to-node=&quot;4,2,0&quot; data-index-in-node=&quot;0&quot;&gt;한계:&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;최신 하드웨어가 필요하며, 4-bit 양자화(GPTQ, AWQ)만큼 모델 크기를 드라마틱하게 줄이지는 못한다.(보통 FP16 대비 50% 절감가능).&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1768804769830&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
import torch
from datasets import load_dataset
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.modifiers.quantization import QuantizationModifier


model_path = &quot;양자화 할 모델 경로&quot;

print(model_path)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    attn_implementation=&quot;flash_attention_2&quot;,
    device_map=&quot;auto&quot;,
)

processor = AutoProcessor.from_pretrained(&quot;Qwen/Qwen3-VL-8B-Instruct&quot;)

DATASET_ID = &quot;neuralmagic/calibration&quot;
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 8192

ds = load_dataset(DATASET_ID, name=&quot;LLM&quot;, split=f&quot;train[:{NUM_CALIBRATION_SAMPLES}]&quot;)
ds = ds.shuffle(seed=42)


def preprocess_function(example):
    messages = []
    for message in example[&quot;messages&quot;]:
        messages.append(
            {
                &quot;role&quot;: message[&quot;role&quot;],
                &quot;content&quot;: [{&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: message[&quot;content&quot;]}],
            }
        )

    return processor.apply_chat_template(
        messages,
        return_tensors=&quot;pt&quot;,
        padding=False,
        truncation=True,
        max_length=MAX_SEQUENCE_LENGTH,
        tokenize=True,
        add_special_tokens=False,
        return_dict=True,
        add_generation_prompt=False,
    )


ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)


def data_collator(batch):
    assert len(batch) == 1
    return {
        key: (
            torch.tensor(value)
            if key != &quot;pixel_values&quot;
            else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
        )
        for key, value in batch[0].items()
    }


# Configure AWQ quantization with smoothing and balancing
# NOTE: This recipe uses W4A16 quantization with group_size=32
# rather than the default preset with group_size=128

recipe = QuantizationModifier(
    targets=&quot;Linear&quot;,
    scheme=&quot;FP8_DYNAMIC&quot;,
    ignore=[
        &quot;re:.*lm_head&quot;,
        &quot;re:visual.*&quot;,
        &quot;re:model.visual.*&quot;,
        &quot;re:.*mlp.gate$&quot;,
    ],
)
# Apply quantization.
oneshot(model=model, recipe=recipe)

# Save to disk in compressed-tensors format.
SAVE_DIR = &quot;qwen3_vl_8b_FP8&quot;
model.save_pretrained(SAVE_DIR)
processor.save_pretrained(SAVE_DIR)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-path-to-node=&quot;5&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;2. GPTQ (Generative Pre-trained Transformer Quantization)&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;p data-path-to-node=&quot;6&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;GPTQ는 &lt;b data-index-in-node=&quot;6&quot; data-path-to-node=&quot;6&quot;&gt;가중치 전용(Weight-only)&lt;/b&gt; 양자화 기법으로, 모델의 가중치를 4-bit로 압축하는 데 가장 널리 사용되는 방식 중 하나&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-path-to-node=&quot;7&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;7,0,0&quot;&gt;특징:&lt;/b&gt; 레이어별로 가중치를 양자화할 때 발생하는 오차를 수학적(Hessian 행렬 기반)으로 최적화하여 최소화한다.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;7,1,0&quot;&gt;장점:&lt;/b&gt; &lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;6&quot; data-path-to-node=&quot;7,1,0&quot;&gt;압축률:&lt;/b&gt; 모델 크기를 약 1/4로 줄여 저사양 GPU에서도 대형 모델을 실행할 수 있다.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;7,1,1,0,0&quot;&gt;정교함:&lt;/b&gt; 가중치 간의 상관관계를 고려하여 정밀하게 보정하므로 4-bit에서도 성능 저하가 적다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;7,2,0&quot;&gt;한계:&lt;/b&gt; 양자화 과정에서 캘리브레이션 데이터가 필요하며 연산 시간이 꽤 걸린다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1768804933185&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
import base64
from io import BytesIO
import torch
from datasets import load_dataset
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.utils import dispatch_for_generation


model_path = &quot;양자화 할 모델 경로&quot;

print(model_path)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    device_map='cpu',torch_dtype=&quot;auto&quot;,local_files_only=True
)
model.eval()
processor = AutoProcessor.from_pretrained(&quot;Qwen/Qwen3-VL-8B-Instruct&quot;, trust_remote_code=True)

# Oneshot arguments
DATASET_ID = &quot;flikr30k&quot;
DATASET_SPLIT = &quot;test[:512]&quot;
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 8192

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42)


# Apply chat template and tokenize inputs.
def preprocess_and_tokenize(example):
    # preprocess
    buffered = BytesIO()
    example[&quot;image&quot;].save(buffered, format=&quot;PNG&quot;)
    encoded_image = base64.b64encode(buffered.getvalue())
    encoded_image_text = encoded_image.decode(&quot;utf-8&quot;)
    base64_qwen = f&quot;data:image;base64,{encoded_image_text}&quot;
    messages = [
        {
            &quot;role&quot;: &quot;user&quot;,
            &quot;content&quot;: [
                {&quot;type&quot;: &quot;image&quot;, &quot;image&quot;: base64_qwen},
                {&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: &quot;What does the image show?&quot;},
            ],
        }
    ]
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)

    # tokenize
    return processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
    )

ds = ds.map(preprocess_and_tokenize, remove_columns=ds.column_names)

# Define a oneshot data collator for multimodal inputs.
def data_collator(batch):
    assert len(batch) == 1
    return {key: torch.tensor(value) for key, value in batch[0].items()}


# Recipe

recipe = [
    GPTQModifier(
        targets=&quot;Linear&quot;,
        scheme=&quot;W4A16&quot;,
        ignore=[
            &quot;re:.*lm_head&quot;,
            &quot;re:visual.*&quot;,
            &quot;re:model.visual.*&quot;,
            &quot;re:.*mlp.gate$&quot;
        ],
    ),
]
# Perform oneshot
oneshot(
    model=model,
    tokenizer=&quot;Qwen/Qwen3-VL-8B-Instruct&quot;,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
    data_collator=data_collator,
    sequential_targets=[&quot;Qwen3VLTextDecoderLayer&quot;],
)

# Confirm generations of the quantized model look sane.
print(&quot;========== SAMPLE GENERATION ==============&quot;)
dispatch_for_generation(model)
messages = [
    {
        &quot;role&quot;: &quot;user&quot;,
        &quot;content&quot;: [
            {
                &quot;type&quot;: &quot;image&quot;,
                &quot;image&quot;: &quot;http://images.cocodataset.org/train2017/000000231895.jpg&quot;,
            },
            {&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: &quot;Please describe the animal in this image\n&quot;},
        ],
    }
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[prompt],
    images=image_inputs,
    videos=video_inputs,
    padding=False,
    max_length=MAX_SEQUENCE_LENGTH,
    truncation=True,
    return_tensors=&quot;pt&quot;,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
print(&quot;==========================================&quot;)


# Save to disk compressed.
SAVE_DIR = &quot;qwen3_vl_8b_gptq&quot;
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h4 data-path-to-node=&quot;8&quot; data-ke-size=&quot;size20&quot;&gt;&amp;nbsp;&lt;/h4&gt;
&lt;h4 data-path-to-node=&quot;8&quot; data-ke-size=&quot;size20&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;3. AWQ (Activation-aware Weight Quantization)&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h4&gt;
&lt;p data-path-to-node=&quot;9&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;AWQ는 가중치 중에서도 결정적인 역할을 하는 중요한 가중치가 따로 있다는 점에 착안한 방식이다.&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-path-to-node=&quot;10&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;10,0,0&quot;&gt;특징:&lt;/b&gt; 모든 가중치를 똑같이 대하지 않고, 실제 추론 시 활성화값(Activation)이 크게 나타나는 통로의 가중치를 보호하여 오차를 최소화 한다.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;10,1,0&quot;&gt;.장점:&lt;/b&gt; 별도의 복잡한 최적화 과정 없이 스케일링만 조정하므로, 특정 데이터셋에 과적합될 위험이 적다&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-path-to-node=&quot;10,1,1&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;10,1,1,0,0&quot;&gt;속도:&lt;/b&gt; GPTQ보다 양자화 과정 자체가 빠르며, 최신 vLLM에서 높은 성능을 보이고 있다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b data-index-in-node=&quot;0&quot; data-path-to-node=&quot;10,2,0&quot;&gt;한계:&lt;/b&gt; GPTQ와 마찬가지로 주로 가중치 위주의 압축에 집중한다는 한계가 있다.&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre id=&quot;code_1768805064930&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
import torch
from datasets import load_dataset
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from llmcompressor import oneshot
from llmcompressor.modeling import replace_modules_for_calibration
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor.utils import dispatch_for_generation


model_path = &quot;양자화 할 모델 경로&quot;

print(model_path)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype=torch.bfloat16,
    attn_implementation=&quot;flash_attention_2&quot;,
    device_map=&quot;auto&quot;,
)
model.eval()
processor = AutoProcessor.from_pretrained(&quot;Qwen/Qwen3-VL-8B-Instruct&quot;)

DATASET_ID = &quot;neuralmagic/calibration&quot;
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 8192

ds = load_dataset(DATASET_ID, name=&quot;LLM&quot;, split=f&quot;train[:{NUM_CALIBRATION_SAMPLES}]&quot;)
ds = ds.shuffle(seed=42)


def preprocess_function(example):
    messages = []
    for message in example[&quot;messages&quot;]:
        messages.append(
            {
                &quot;role&quot;: message[&quot;role&quot;],
                &quot;content&quot;: [{&quot;type&quot;: &quot;text&quot;, &quot;text&quot;: message[&quot;content&quot;]}],
            }
        )

    return processor.apply_chat_template(
        messages,
        return_tensors=&quot;pt&quot;,
        padding=False,
        truncation=True,
        max_length=MAX_SEQUENCE_LENGTH,
        tokenize=True,
        add_special_tokens=False,
        return_dict=True,
        add_generation_prompt=False,
    )


ds = ds.map(preprocess_function, batched=False, remove_columns=ds.column_names)


def data_collator(batch):
    assert len(batch) == 1
    return {
        key: (
            torch.tensor(value)
            if key != &quot;pixel_values&quot;
            else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
        )
        for key, value in batch[0].items()
    }


# Configure AWQ quantization with smoothing and balancing
# NOTE: This recipe uses W4A16 quantization with group_size=32
# rather than the default preset with group_size=128
recipe = AWQModifier(
    ignore=[
        &quot;re:.*embed_tokens&quot;,
        &quot;re:.*input_layernorm$&quot;,
        &quot;re:.*mlp[.]gate$&quot;,
        &quot;re:.*post_attention_layernorm$&quot;,
        &quot;re:.*norm$&quot;,
        &quot;re:model[.]visual.*&quot;,
        &quot;re:visual.*&quot;,
        &quot;lm_head&quot;,
    ],
    duo_scaling=True,
    config_groups={
        &quot;group_0&quot;: {
            &quot;targets&quot;: [&quot;Linear&quot;],
            &quot;weights&quot;: {
                &quot;num_bits&quot;: 8,#4,
                &quot;type&quot;: &quot;int&quot;,
                &quot;symmetric&quot;: True,
                &quot;group_size&quot;: 32,
                &quot;strategy&quot;: &quot;group&quot;,
                &quot;dynamic&quot;: False,
                &quot;actorder&quot;: None,
                &quot;observer&quot;: &quot;mse&quot;,
            },
        }
    },
)

# Apply AWQ quantization.
oneshot(
    model=model,
    processor=processor,
    recipe=recipe,
    dataset=ds,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    data_collator=data_collator,
)

print(&quot;========== SAMPLE GENERATION ==============&quot;)
dispatch_for_generation(model)
input_ids = processor(text=&quot;Hello my name is&quot;, return_tensors=&quot;pt&quot;).input_ids.to(&quot;cuda&quot;)
output = model.generate(input_ids, max_new_tokens=20)
print(processor.decode(output[0]))
print(&quot;==========================================&quot;)

# Save to disk in compressed-tensors format.
SAVE_DIR = &quot;./qwen3_vl_8b_awq8&quot;
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Pytorch</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/195</guid>
      <comments>https://ga02-ailab.tistory.com/195#entry195comment</comments>
      <pubDate>Mon, 19 Jan 2026 17:04:01 +0900</pubDate>
    </item>
    <item>
      <title>[3] Context Cascade Compression: Exploring the UpperLimits of Text Compression</title>
      <link>https://ga02-ailab.tistory.com/194</link>
      <description>&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[paper] &lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://arxiv.org/pdf/2511.15244&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2511.15244&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://github.com/liufanfanlff/C3-Context-Cascade-Compression&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/liufanfanlff/C3-Context-Cascade-Compression&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1767192344635&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - liufanfanlff/C3-Context-Cascade-Compression: Official code implementation of Context Cascade Compression: Exploring the&quot; data-og-description=&quot;Official code implementation of Context Cascade Compression: Exploring the Upper Limits of Text Compression - liufanfanlff/C3-Context-Cascade-Compression&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/liufanfanlff/C3-Context-Cascade-Compression&quot; data-og-url=&quot;https://github.com/liufanfanlff/C3-Context-Cascade-Compression&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/byzxjY/hyZPPY3JFO/q41JgICALfak4GN56KH2H1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/ctK4Ae/hyZQZ6Yp5i/gHIB7GWmVYunYHk6Fj04LK/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/liufanfanlff/C3-Context-Cascade-Compression&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/liufanfanlff/C3-Context-Cascade-Compression&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/byzxjY/hyZPPY3JFO/q41JgICALfak4GN56KH2H1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/ctK4Ae/hyZQZ6Yp5i/gHIB7GWmVYunYHk6Fj04LK/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - liufanfanlff/C3-Context-Cascade-Compression: Official code implementation of Context Cascade Compression: Exploring the&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Official code implementation of Context Cascade Compression: Exploring the Upper Limits of Text Compression - liufanfanlff/C3-Context-Cascade-Compression&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc1c8;&quot;&gt;Abstract&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LLM은 한번에 처리 가능한 context 길이에 한계가 있음 &amp;rarr; long context 를 압축하기 위한 연구 활발(예: DeepSeek-OCR)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;효율적인 text compression의 방법에 대해 탐구하는 Context Cascade Compression (C3) 제안&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc1c8;&quot;&gt;Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LLM은 한번에 처리 가능한 context 의 길이에 한계가 있음 : 계산량 / 메모리 등 문제&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이를 해결하기 위한 기존의 접근법들&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1850&quot; data-origin-height=&quot;890&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/brDhS6/dJMcac9x0r4/GCCIonnJEo0Q4s2WMnk5L1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/brDhS6/dJMcac9x0r4/GCCIonnJEo0Q4s2WMnk5L1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/brDhS6/dJMcac9x0r4/GCCIonnJEo0Q4s2WMnk5L1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbrDhS6%2FdJMcac9x0r4%2FGCCIonnJEo0Q4s2WMnk5L1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;736&quot; height=&quot;354&quot; data-origin-width=&quot;1850&quot; data-origin-height=&quot;890&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;attention 개선시도 &amp;rarr; 토큰 수 자체를 줄이는 방법은 아니기 때문에 여전히 계산량 많음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;정보 검색 기반으로 입력을 줄이는 방법 시도 &amp;rarr; 정보 손실 동반 + 추가적인 latency&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;DeepSeek-OCR의 개선 방법&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;텍스트를 이미지로 변환한 뒤 OCR로 복원하는 OCR기반 압축 방법&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;텍스트 전체를 그림으로 만들어 ViT 입력으로 제공하는 방식 &amp;rarr; 이미지가 텍스트 자체보다 훨씬 적은 토큰으로 정보를 표현할 수 있다는 insight&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;실제로 10배 압축하여 정확도 97% 달성&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그러나, 텍스트&amp;rarr;이미지&amp;rarr;비전 토큰&amp;rarr;LLM 복잡한 파이프라인 필요 + 해상도 감소시 성능도 함께 감소 하는 등의 문제 발생&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;따라서 본 논문에서 &lt;b&gt;Context Cascade Compression (C3)&lt;/b&gt; 제안&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;두 단계의 LLM을 직접 cascade하여 텍스트를 압축 및 복원하는 아이디어&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;stage 1 : small LLM으로 텍스트를 n개의 latent token으로 변환&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;stage 2: large LLM으로 down-stream task 수행 (예: 원본 텍스트 복원)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 순수 텍스트 기반 파이프라인 : 문서의 레이아웃, 색상 등 시각 정보나 화질 저하로 인한 손실을 걱정할 필요없음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc1c8;&quot;&gt;Method&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1868&quot; data-origin-height=&quot;948&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bHYmaO/dJMcaaqpaIz/PLkRFrNvs1rUBrknSvz15k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bHYmaO/dJMcaaqpaIz/PLkRFrNvs1rUBrknSvz15k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bHYmaO/dJMcaaqpaIz/PLkRFrNvs1rUBrknSvz15k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbHYmaO%2FdJMcaaqpaIz%2FPLkRFrNvs1rUBrknSvz15k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1868&quot; height=&quot;948&quot; data-origin-width=&quot;1868&quot; data-origin-height=&quot;948&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;3.1 Architecture&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;context compression encoder LLM 과 decoder LLM이 혼합된 구조&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;context compression encoder LLM &amp;rArr; 정보압축 (text tokens &amp;rarr; latent tokens)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;decoder LLM &amp;rArr; latent tokens와 prompt 활용해 아웃풋 생성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;3.2 Context Compression Encoder LLM&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Qwen2.5 1.5B을 파인튜닝하여 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;학습 가능한 context query 사용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;N개의 임베딩 벡터를 원본 텍스트 토큰 시퀀스에 이어붙임 &amp;rarr; self-attention &amp;rarr; 이 후에는 N개의 임베딩벡터만 사용&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;모델 구조 변경이나 추가적인 attention도입 없이 기존 Transformer의 인코딩 과정만으로 context를 압축했다는 장점이 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;3.3 Decoder LLM&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Qwen2.5 3B를 text reconstruction에 파인튜닝하여 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 논문에서는 원본 텍스트를 복원하는 테스크에만 집중&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;복원에 사용한 프롬프트: &amp;ldquo;repeat the text:&amp;rdquo;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc1c8;&quot;&gt;Experiments&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;4.1 Data&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;영어 + 중국어 문서 OCR 데이터 1백만장&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;데이터 전처리는 최소화, 긴 문서부터 짧은 글까지 다양한 길이의 텍스트를 단순 합쳐서 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;4.2 Training Setup&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;8 NVIDIA H800 GPUs&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;40,000 step&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;optimizer: AdamW&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;4.3 Context Compression Study&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;4.3.1 Quantitative&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Fox 벤치마크 - 영문 문서 부분(텍스트 길이 600~1300 토큰)을 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;원본 문서를 얼마나 정확히 재구성하는지에 대한 정확도 측정 - DeepSeekOCR과 비교&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;압축 비율별 성능 비교&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1800&quot; data-origin-height=&quot;550&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcSXs6/dJMcad1Gb39/pTWS9EZVm3cmy5LMkV65L1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcSXs6/dJMcad1Gb39/pTWS9EZVm3cmy5LMkV65L1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcSXs6/dJMcad1Gb39/pTWS9EZVm3cmy5LMkV65L1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbcSXs6%2FdJMcad1Gb39%2FpTWS9EZVm3cmy5LMkV65L1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1800&quot; height=&quot;550&quot; data-origin-width=&quot;1800&quot; data-origin-height=&quot;550&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;DeepSeekOCR보다 우수한 성능&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;DeepSeek-OCR은 압축률이 높아질수록 해상도 감소로 인한 큰 성능 저하 발생&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;C3는 처음부터 텍스트 도메인에서 직접 의미를 압축하므로 압축률이 높아져도 성능저하가 심하지 않음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;또한 소형 LLM이 사전학습을 통해 습득한 지식을 활용하여 핵심 정보를 뽑아내므로, 정보 손실이 적은 context vector를 만들어 내는게 가능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;극한의 압축 (latent tokens = 32)&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;872&quot; data-origin-height=&quot;550&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bpSr9F/dJMcagKVSYf/e3Rq1MzsjrIMWVHULgtQqK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bpSr9F/dJMcagKVSYf/e3Rq1MzsjrIMWVHULgtQqK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bpSr9F/dJMcagKVSYf/e3Rq1MzsjrIMWVHULgtQqK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbpSr9F%2FdJMcagKVSYf%2Fe3Rq1MzsjrIMWVHULgtQqK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;392&quot; height=&quot;247&quot; data-origin-width=&quot;872&quot; data-origin-height=&quot;550&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;90%대의 높은 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;4.3.2 Qualitative&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;C3는 내용의 의미 논리가 없어도(의미 없는 무작위 문자열이 섞인 텍스트나 문장 구조가 뒤섞인 비정형 텍스트) 원문을 거의 완벽하게 복원&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;텍스트의 패턴 자체를 압축/해제하는 능력을 습득했음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;에러케이스&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;주로 텍스트의 뒷부분에 집중 - 뒷부분 일부만 잊어버림&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;압축 토큰 수가 부족할 때 정보가 순차적으로 소실되는 것 &amp;rarr; latent token수가 너무 적어 텍스트 전체를 다 담을 수 없어 뒷부분 텍스트 누락&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;반면에 DeepSeek-OCR의 오류는 전체 텍스트에서 발생&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2016&quot; data-origin-height=&quot;504&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uF0U8/dJMcaaRtL12/idAUqQgMS5uVyIMORwvZ11/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uF0U8/dJMcaaRtL12/idAUqQgMS5uVyIMORwvZ11/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uF0U8/dJMcaaRtL12/idAUqQgMS5uVyIMORwvZ11/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuF0U8%2FdJMcaaRtL12%2FidAUqQgMS5uVyIMORwvZ11%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2016&quot; height=&quot;504&quot; data-origin-width=&quot;2016&quot; data-origin-height=&quot;504&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc1c8;&quot;&gt;Limitation&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;텍스트 후반부로 갈수록 오류 집중 &amp;rarr; 인간의 기억과 유사&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;압축과정에 대한 추가적인 개선 필요&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div id=&quot;gtx-trans&quot; style=&quot;position: absolute; left: 521px; top: 4577.36px;&quot;&gt;
&lt;div class=&quot;gtx-trans-icon&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;</description>
      <category>Paper Review/LLM &amp;amp; VLM</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/194</guid>
      <comments>https://ga02-ailab.tistory.com/194#entry194comment</comments>
      <pubDate>Wed, 31 Dec 2025 23:50:25 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] C++ OpenCV / Image Filtering Using Convolution</title>
      <link>https://ga02-ailab.tistory.com/193</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;C++과 OpenCV를 이용해 이미지에 다양한 필터를 적용하고 블러처리하는 방법을 알아보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. identity 커널 &amp;amp; Custom 2D-Convolution Kernel blurring&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762764591745&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../filtered.jpg&quot;);

    // identity 커널 적용
    Mat kernel1 = (Mat_&amp;lt;double&amp;gt;(3,3) &amp;lt;&amp;lt; 0,0,0,0,1,0,0,0,0);
    Mat identity;
    filter2D(image, identity, -1, kernel1, Point(-1,-1), 0, 4);
    imwrite(&quot;identity.jpg&quot;, identity);


    // Blurring an Image using a Custom 2D-Convolution Kernel 적용
    Mat kernel2 = Mat::ones(5,5, CV_64F);
    kernel2 = kernel2 /25;
    Mat img;
    filter2D(image, img, -1, kernel2, Point(-1, -1), 0, 4);
    imwrite(&quot;blur.jpg&quot;, img);

    cout &amp;lt;&amp;lt; &quot;finish!!&quot;&amp;lt;&amp;lt;endl;
    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zBKVd/dJMcacBrgq1/hMOJkuh1X7A5XEVGVO7bVk/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zBKVd/dJMcacBrgq1/hMOJkuh1X7A5XEVGVO7bVk/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;filtered.jpg&quot; style=&quot;width: 32.5581%; margin-right: 10px;&quot; data-widthpercent=&quot;33.33&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zBKVd/dJMcacBrgq1/hMOJkuh1X7A5XEVGVO7bVk/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzBKVd%2FdJMcacBrgq1%2FhMOJkuh1X7A5XEVGVO7bVk%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nmNZ8/dJMcabP38pW/cY7Ix90Vbjwj5w1LUm3tok/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nmNZ8/dJMcabP38pW/cY7Ix90Vbjwj5w1LUm3tok/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;identity.jpg&quot; style=&quot;width: 32.5581%; margin-right: 10px;&quot; data-widthpercent=&quot;33.33&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nmNZ8/dJMcabP38pW/cY7Ix90Vbjwj5w1LUm3tok/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnmNZ8%2FdJMcabP38pW%2FcY7Ix90Vbjwj5w1LUm3tok%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uyLZj/dJMcaesuhS7/F3XMFcpBdeemrvQTuDBQKk/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uyLZj/dJMcaesuhS7/F3XMFcpBdeemrvQTuDBQKk/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;blur.jpg&quot; style=&quot;width: 32.5581%;&quot; data-widthpercent=&quot;33.34&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uyLZj/dJMcaesuhS7/F3XMFcpBdeemrvQTuDBQKk/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuyLZj%2FdJMcaesuhS7%2FF3XMFcpBdeemrvQTuDBQKk%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
  &lt;figcaption&gt;원본 이미지 / identity 커널 적용 / 2D-convolution kernel 적용&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Gaussian Blur&amp;nbsp; &amp;amp;&amp;nbsp; &amp;nbsp;Median blur&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762765079147&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../filtered.jpg&quot;);

    Mat gaussian_blur;
    GaussianBlur(image, gaussian_blur, Size(5,5), 0, 0);
    imwrite(&quot;gaussian.jpg&quot;, gaussian_blur);

    Mat median_blurred;
    medianBlur(image, median_blurred, (5,5));
    imwrite(&quot;median.jpg&quot;, median_blurred);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/1NnrL/dJMcacIcVpz/4TxK1ZZY675tZ1rLwL3GOk/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/1NnrL/dJMcacIcVpz/4TxK1ZZY675tZ1rLwL3GOk/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;filtered.jpg&quot; style=&quot;width: 32.5581%; margin-right: 10px;&quot; data-widthpercent=&quot;33.33&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/1NnrL/dJMcacIcVpz/4TxK1ZZY675tZ1rLwL3GOk/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F1NnrL%2FdJMcacIcVpz%2F4TxK1ZZY675tZ1rLwL3GOk%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/doaYqc/dJMcafLHle8/9UIcY9VeHfZKwWX8XlodB1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/doaYqc/dJMcafLHle8/9UIcY9VeHfZKwWX8XlodB1/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;gaussian.jpg&quot; style=&quot;width: 32.5581%; margin-right: 10px;&quot; data-widthpercent=&quot;33.33&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/doaYqc/dJMcafLHle8/9UIcY9VeHfZKwWX8XlodB1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdoaYqc%2FdJMcafLHle8%2F9UIcY9VeHfZKwWX8XlodB1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bxANXI/dJMcaawQ4WK/sJICAhjNpomZBmSbzDhZjK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bxANXI/dJMcaawQ4WK/sJICAhjNpomZBmSbzDhZjK/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;median.jpg&quot; style=&quot;width: 32.5581%;&quot; data-widthpercent=&quot;33.34&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bxANXI/dJMcaawQ4WK/sJICAhjNpomZBmSbzDhZjK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbxANXI%2FdJMcaawQ4WK%2FsJICAhjNpomZBmSbzDhZjK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
  &lt;figcaption&gt;원본 이미지 / Gaussian blur / median blur&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. 2D-Convolution Kernel Sharpening&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762765890584&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../filtered.jpg&quot;);

    Mat sharp_img;
    Mat kernel3 = (Mat_&amp;lt;double&amp;gt;(3,3) &amp;lt;&amp;lt; 0, -1, 0,
                                        -1, 5, -1,
                                        0, -1, 0);
    filter2D(image, sharp_img, -1, kernel3, Point(-1, -1), 0, BORDER_DEFAULT);
    imwrite(&quot;sharp.jpg&quot;, sharp_img);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cGOxsM/dJMcad8bzMv/T5ToIG79lUdIpg1JdzKIck/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cGOxsM/dJMcad8bzMv/T5ToIG79lUdIpg1JdzKIck/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;filtered.jpg&quot; style=&quot;width: 49.4186%; margin-right: 10px;&quot; data-widthpercent=&quot;50&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cGOxsM/dJMcad8bzMv/T5ToIG79lUdIpg1JdzKIck/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcGOxsM%2FdJMcad8bzMv%2FT5ToIG79lUdIpg1JdzKIck%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/mo14G/dJMcadNSJWM/WE0JyiOPvV2Q3V59LJVBiK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/mo14G/dJMcadNSJWM/WE0JyiOPvV2Q3V59LJVBiK/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;sharp.jpg&quot; data-widthpercent=&quot;50&quot; style=&quot;width: 49.4186%;&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/mo14G/dJMcadNSJWM/WE0JyiOPvV2Q3V59LJVBiK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fmo14G%2FdJMcadNSJWM%2FWE0JyiOPvV2Q3V59LJVBiK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
  &lt;figcaption&gt;원본이미지 / sharp image&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4. Bilateral Filtering&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762766099745&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../filtered.jpg&quot;);

    Mat bilateral_filter;
    bilateralFilter(image, bilateral_filter, 9, 75, 75);
    imwrite(&quot;bilateral.jpg&quot;, bilateral_filter);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bfc0bi/dJMcahQhyjk/4TpwE8f7gfZlZPF9OLlesK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bfc0bi/dJMcahQhyjk/4TpwE8f7gfZlZPF9OLlesK/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;filtered.jpg&quot; style=&quot;width: 49.4186%; margin-right: 10px;&quot; data-widthpercent=&quot;50&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bfc0bi/dJMcahQhyjk/4TpwE8f7gfZlZPF9OLlesK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbfc0bi%2FdJMcahQhyjk%2F4TpwE8f7gfZlZPF9OLlesK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/NEiUS/dJMcahQhyjs/wsSv7ZBoS4x7KZghzrSkB1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/NEiUS/dJMcahQhyjs/wsSv7ZBoS4x7KZghzrSkB1/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;640&quot; data-origin-height=&quot;426&quot; data-filename=&quot;bilateral.jpg&quot; style=&quot;width: 49.4186%;&quot; data-widthpercent=&quot;50&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/NEiUS/dJMcahQhyjs/wsSv7ZBoS4x7KZghzrSkB1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FNEiUS%2FdJMcahQhyjs%2FwsSv7ZBoS4x7KZghzrSkB1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;640&quot; height=&quot;426&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
  &lt;figcaption&gt;원본 이미지 / Bilateral Filtering&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[참고자료]&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;a href=&quot;https://learnopencv.com/image-filtering-using-convolution-in-opencv/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/image-filtering-using-convolution-in-opencv/&lt;/a&gt;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/193</guid>
      <comments>https://ga02-ailab.tistory.com/193#entry193comment</comments>
      <pubDate>Fri, 5 Dec 2025 15:00:26 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] C++ OpenCV  image annotating</title>
      <link>https://ga02-ailab.tistory.com/192</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;C++에서 opencv를 이용해 이미지에 선, 원 등 다양한 도형을 그리는 방법입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1.선 그리기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762737136659&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Mat imageLine = image.clone();
    Point pointA(200,80);
    Point pointB(450, 80);

    line(imageLine, pointA, pointB, color=Scalar(255,255,0), thickness=3, 8, 0);
    imwrite(&quot;line_image.jpg&quot;, imageLine);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p id=&quot;lineinput-output-array-pt1-pt2-scalarbgr-thickness-linetype-shift&quot; style=&quot;background-color: #ffffff; color: #343541; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;=&amp;gt; line(Input Output array, pt1, pt2, scalar(B,G,R), thickness, lineType, shift)&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;background-color: #ffffff; color: #343541; text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;lineType은 line 안을 채울지 등에 대한 값이며 shift는 무조건 0으로 세팅하면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;line_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/biabbw/dJMcad1pQRS/pM6QNeauhYCykvTp4Kq850/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/biabbw/dJMcad1pQRS/pM6QNeauhYCykvTp4Kq850/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/biabbw/dJMcad1pQRS/pM6QNeauhYCykvTp4Kq850/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbiabbw%2FdJMcad1pQRS%2FpM6QNeauhYCykvTp4Kq850%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;483&quot; height=&quot;322&quot; data-filename=&quot;line_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. 원 그리기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762738839990&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Mat circle_image = image.clone();
    Point circle_center(415,190);

    int radius = 100;

    circle(circle_image, circle_center, radius, Scalar(0,255, 0), 3, 8, 0);
    
    //안이 채워진 원 그리기
    circle(circle_image, circle_center, radius, Scalar(0,255, 0), -1, 8, 0);

    imwrite(&quot;cricle_image.jpg&quot;, circle_image);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KMP0Z/dJMcacamK5J/0EY5YV1Wrlg7YjpB2it3RK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KMP0Z/dJMcacamK5J/0EY5YV1Wrlg7YjpB2it3RK/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot; data-filename=&quot;cricle_image.jpg&quot; width=&quot;500&quot; height=&quot;333&quot; style=&quot;width: 49.4186%; margin-right: 10px;&quot; data-widthpercent=&quot;50&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KMP0Z/dJMcacamK5J/0EY5YV1Wrlg7YjpB2it3RK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKMP0Z%2FdJMcacamK5J%2F0EY5YV1Wrlg7YjpB2it3RK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/1GhHq/dJMcacuFBLe/I2TKONKZ3UUi4nPfQNTXc1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/1GhHq/dJMcacuFBLe/I2TKONKZ3UUi4nPfQNTXc1/img.jpg&quot; data-is-animation=&quot;false&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot; data-filename=&quot;cricle_image.jpg&quot; style=&quot;width: 49.4186%;&quot; data-widthpercent=&quot;50&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/1GhHq/dJMcacuFBLe/I2TKONKZ3UUi4nPfQNTXc1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F1GhHq%2FdJMcacuFBLe%2FI2TKONKZ3UUi4nPfQNTXc1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1024&quot; height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. 사각형 그리기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762739147898&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Mat rect_image = image.clone();
    Point start_point(300,115);
    Point end_point(475, 225);

    rectangle(rect_image, start_point, end_point, Scalar(0,0,255), 3, 8, 0);

    imwrite(&quot;rect_image.jpg&quot;, rect_image);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;rect_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dSLjwk/dJMcaiaAb2w/zhvG6ZdLroyGnlJbkV4yE0/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dSLjwk/dJMcaiaAb2w/zhvG6ZdLroyGnlJbkV4yE0/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dSLjwk/dJMcaiaAb2w/zhvG6ZdLroyGnlJbkV4yE0/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdSLjwk%2FdJMcaiaAb2w%2FzhvG6ZdLroyGnlJbkV4yE0%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;447&quot; height=&quot;298&quot; data-filename=&quot;rect_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4.&amp;nbsp; 타원 그리기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762739883636&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Mat imageEllipse = image.clone();
    Point ellipse_center(415,190);
    Point axis1(100, 50);
    Point axis2(125, 50);

    ellipse(imageEllipse, ellipse_center, axis1, 0, 0, 360, Scalar(255, 0, 0), 3, 8, 0);
    ellipse(imageEllipse, ellipse_center, axis2, 90, 0, 360, Scalar(0, 0, 255), 3, 8, 0);
    
    imwrite(&quot;ellipse_image.jpg&quot;, imageEllipse);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;ellipse_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/I1j2A/dJMcai2IV8Q/kF42tuiV0IT14kwiIHA7v1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/I1j2A/dJMcai2IV8Q/kF42tuiV0IT14kwiIHA7v1/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/I1j2A/dJMcai2IV8Q/kF42tuiV0IT14kwiIHA7v1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FI1j2A%2FdJMcai2IV8Q%2FkF42tuiV0IT14kwiIHA7v1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;410&quot; height=&quot;273&quot; data-filename=&quot;ellipse_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4-1. 반원 반만 채우기&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762740489641&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Mat halfEllipse = image.clone();
    Point ellipse_center(415,190);
    Point axis1(100, 50);

    ellipse(halfEllipse, ellipse_center, axis1, 0, 180, 360, Scalar(255, 0, 0), 3, 8, 0);
    ellipse(halfEllipse, ellipse_center, axis1, 0, 0, 180, Scalar(0, 0, 255), -2, 8, 0);
    
    imwrite(&quot;ellipse_image.jpg&quot;, halfEllipse);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;ellipse_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bbwUyr/dJMcai9uxD2/vlUxJxzKbdTASKlKHckgL1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bbwUyr/dJMcai9uxD2/vlUxJxzKbdTASKlKHckgL1/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bbwUyr/dJMcai9uxD2/vlUxJxzKbdTASKlKHckgL1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbbwUyr%2FdJMcai9uxD2%2FvlUxJxzKbdTASKlKHckgL1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;506&quot; height=&quot;337&quot; data-filename=&quot;ellipse_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;5. text 삽입&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762740852512&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Mat imageText = image.clone();
    putText(imageText, &quot;A beautiful flower!~&quot;, Point(50, 350), FONT_HERSHEY_COMPLEX, 1.5, Scalar(250, 250, 100));
    
    imwrite(&quot;text_image.jpg&quot;, imageText);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;text_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/tiLMy/dJMcaj1DdcK/5ApqSwyZlSd72TDiltlP8K/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/tiLMy/dJMcaj1DdcK/5ApqSwyZlSd72TDiltlP8K/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/tiLMy/dJMcaj1DdcK/5ApqSwyZlSd72TDiltlP8K/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FtiLMy%2FdJMcaj1DdcK%2F5ApqSwyZlSd72TDiltlP8K%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;440&quot; height=&quot;293&quot; data-filename=&quot;text_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[참고자료]&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;a href=&quot;https://learnopencv.com/annotating-images-using-opencv/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/annotating-images-using-opencv/&lt;/a&gt;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/192</guid>
      <comments>https://ga02-ailab.tistory.com/192#entry192comment</comments>
      <pubDate>Fri, 21 Nov 2025 15:00:47 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] C++ OpenCV 이미지 다루기 기초</title>
      <link>https://ga02-ailab.tistory.com/191</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기본적으로 C++에서 이미지 읽기 및 쓰기 / resizing / crop / rotation 등 다양한 변환을 하는 방법입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;test.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/btrYhQ/dJMcain6lsy/F8RjHx9s4crbtwEVgrSG40/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/btrYhQ/dJMcain6lsy/F8RjHx9s4crbtwEVgrSG40/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/btrYhQ/dJMcain6lsy/F8RjHx9s4crbtwEVgrSG40/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbtrYhQ%2FdJMcain6lsy%2FF8RjHx9s4crbtwEVgrSG40%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;486&quot; height=&quot;324&quot; data-filename=&quot;test.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 이미지를 사용할게요.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. 이미지 읽기 및 쓰기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762416345450&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat img_grayscale =imread(&quot;test.jpg&quot;, 0); 
    //또는
    Mat img_grayscale = imread(&quot;test.jpg&quot;, IMREAD_GRAYSCALE);
    imwrite(&quot;grayscale.jpg&quot;, img_grayscale); //grayscale 로 바꿔 저장
    
    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;grayscale.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/blOLGR/dJMb995LwoC/Q4fB0mFw3pwi5HexTv9zxk/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/blOLGR/dJMb995LwoC/Q4fB0mFw3pwi5HexTv9zxk/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/blOLGR/dJMb995LwoC/Q4fB0mFw3pwi5HexTv9zxk/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FblOLGR%2FdJMb995LwoC%2FQ4fB0mFw3pwi5HexTv9zxk%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;434&quot; height=&quot;289&quot; data-filename=&quot;grayscale.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;채널 손실없이 읽고싶다면 아래처럼 하면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762416493183&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Mat img_unchanged = imread(&quot;test.jpg&quot;, IMREAD_UNCHANGED);

//또는
Mat img_unchanged = imread(&quot;test.jpg&quot;, -1);&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지 사이즈를 확인하는 방법은 아래와같이 두 가지 방법이 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762417518802&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;// 방법 1
Size s = image.size();
cout &amp;lt;&amp;lt; &quot;원본사이즈: &quot;&amp;lt;&amp;lt; s &amp;lt;&amp;lt;endl;
cout &amp;lt;&amp;lt; &quot;Width : &quot; &amp;lt;&amp;lt; image.size().width &amp;lt;&amp;lt; endl;
cout &amp;lt;&amp;lt; &quot;Height: &quot; &amp;lt;&amp;lt; image.size().height &amp;lt;&amp;lt; endl;
cout&amp;lt;&amp;lt;&quot;Channels: :&quot;&amp;lt;&amp;lt; img.channels() &amp;lt;&amp;lt; endl;


// 방법 2
cout &amp;lt;&amp;lt; &quot;Original Height and Width :&quot; &amp;lt;&amp;lt; image.rows &amp;lt;&amp;lt; &quot;x&quot; &amp;lt;&amp;lt; image.cols &amp;lt;&amp;lt; endl;&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. 이미지 resize&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 이미지 사이즈를 직접 지정해주는 방법입니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762417230959&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;, 0);
    Size s = image.size();
    cout &amp;lt;&amp;lt; &quot;원본사이즈: &quot;&amp;lt;&amp;lt; s &amp;lt;&amp;lt;endl;

    int down_width = 300;
    int down_height = 200;

    Mat resized_down;

	//(300,200)으로 resize
    resize(image, resized_down, Size(down_width, down_height),INTER_LINEAR);
    Size ds = resized_down.size();
    cout &amp;lt;&amp;lt; &quot;조정된 사이즈: &quot; &amp;lt;&amp;lt; ds &amp;lt;&amp;lt;endl;

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;68&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bONlqo/dJMcaf5YYh0/CVRrbRUBG4gLfYSZPuF8e0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bONlqo/dJMcaf5YYh0/CVRrbRUBG4gLfYSZPuF8e0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bONlqo/dJMcaf5YYh0/CVRrbRUBG4gLfYSZPuF8e0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbONlqo%2FdJMcaf5YYh0%2FCVRrbRUBG4gLfYSZPuF8e0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;476&quot; height=&quot;51&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;68&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;scale factor를 주어 조정하는 방법도 가능합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762417698764&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;, 0);
    Size s = image.size();
    cout &amp;lt;&amp;lt; &quot;원본사이즈: &quot;&amp;lt;&amp;lt; s &amp;lt;&amp;lt;endl;

    double scale_up_x = 1.2;
    double scale_up_y = 1.2;
    Mat scaled_f_up;

    resize(image, scaled_f_up, Size(), scale_up_x, scale_up_y, INTER_LINEAR);
    Size ds = scaled_f_up.size();
    cout &amp;lt;&amp;lt; &quot;조정된 사이즈: &quot; &amp;lt;&amp;lt; ds &amp;lt;&amp;lt;endl;
    
    cout &amp;lt;&amp;lt; &quot;finish!!&quot;&amp;lt;&amp;lt;endl;
    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;68&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cvDBpX/dJMcaeeVZTw/4hOSTVkXV3Nmu51Z3rdqE0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cvDBpX/dJMcaeeVZTw/4hOSTVkXV3Nmu51Z3rdqE0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cvDBpX/dJMcaeeVZTw/4hOSTVkXV3Nmu51Z3rdqE0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcvDBpX%2FdJMcaeeVZTw%2F4hOSTVkXV3Nmu51Z3rdqE0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;466&quot; height=&quot;50&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;68&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;** interpolation에는 총 4가지가 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1884&quot; data-origin-height=&quot;516&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OTVvE/dJMcacVIRcO/uCOApffaO958S5OyKh1sS0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OTVvE/dJMcacVIRcO/uCOApffaO958S5OyKh1sS0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OTVvE/dJMcacVIRcO/uCOApffaO958S5OyKh1sS0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOTVvE%2FdJMcacVIRcO%2FuCOApffaO958S5OyKh1sS0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1884&quot; height=&quot;516&quot; data-origin-width=&quot;1884&quot; data-origin-height=&quot;516&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. 이미지 crop&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Range 를 이용해서 범위를 지정하여 crop 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762418400733&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;test.jpg&quot;);
    cout &amp;lt;&amp;lt; &quot;Width: &quot; &amp;lt;&amp;lt; image.size().width &amp;lt;&amp;lt; endl;
    cout &amp;lt;&amp;lt; &quot;Height: &quot; &amp;lt;&amp;lt; image.size().height &amp;lt;&amp;lt; endl;
    cout &amp;lt;&amp;lt; &quot;Channels: &quot; &amp;lt;&amp;lt; image.channels() &amp;lt;&amp;lt; endl;

    Mat cropped_image = image(Range(80,480), Range(150, 430)); // (height, width)

    imwrite(&quot;crop_image.jpg&quot;, cropped_image);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;crop_image.jpg&quot; data-origin-width=&quot;280&quot; data-origin-height=&quot;400&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Rsozg/dJMcabP2Yc4/R76sjgDEmyBhRIyKPsZNb1/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Rsozg/dJMcabP2Yc4/R76sjgDEmyBhRIyKPsZNb1/img.jpg&quot; data-alt=&quot;crop_image.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Rsozg/dJMcabP2Yc4/R76sjgDEmyBhRIyKPsZNb1/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FRsozg%2FdJMcabP2Yc4%2FR76sjgDEmyBhRIyKPsZNb1%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;200&quot; height=&quot;286&quot; data-filename=&quot;crop_image.jpg&quot; data-origin-width=&quot;280&quot; data-origin-height=&quot;400&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;crop_image.jpg&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;for 문을 이용해 이미지를 여러 조각의 patch로 나눌수도 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762419256423&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);
    Mat image_copy = image.clone();
    int imgheight = image.rows;
    int imgwidth = image.cols;

    int M = 76;
    int N = 104;

    int x1 = 0;
    int y1 = 0;
    for (int y=0; y&amp;lt;imgheight; y=y+M)
    {
        for (int x=0; x&amp;lt;imgwidth ; x=x+N)
        {
            if ((imgheight-y) &amp;lt; M || (imgwidth-x) &amp;lt; N)
            {
                break;
            }
            y1 = y+M;
            x1 = x+N;
            string a = to_string(x);
            string b = to_string(y);

            if (x1 &amp;gt;= imgwidth &amp;amp;&amp;amp; y1 &amp;gt;= imgheight)
            {
                x = imgwidth-1;
                y = imgheight-1;
                x1 = imgwidth-1;
                y1 = imgheight-1;

                Mat tiles = image_copy(Range(y, imgheight), Range(x, imgwidth));

                imwrite(&quot;patched/tile&quot; + a + '_' + b + &quot;.jpg&quot;, tiles);
                rectangle(image, Point(x,y), Point(x1,y1), Scalar(0, 255,0), 1);
            }
            else if(y1 &amp;gt;= imgheight)
            {
                y = imgheight-1;
                y1 = imgheight-1;

                Mat tiles = image_copy(Range(y, imgheight), Range(x, x+N));
                imwrite(&quot;patched/tile&quot; + a + '_' + b + &quot;.jpg&quot;, tiles);
                rectangle(image, Point(x,y), Point(x1,y1), Scalar(0, 255,0), 1);
            }
            else if (x1 &amp;gt;= imgwidth)
            {
                x = imgwidth-1;
                x1 = imgwidth-1;
                
                Mat tiles = image_copy(Range(y, y+M), Range(x, imgwidth));
                imwrite(&quot;patched/tile&quot; + a + '_' + b + &quot;.jpg&quot;, tiles);
                rectangle(image, Point(x,y), Point(x1,y1), Scalar(0, 255,0), 1);
            }
            else
            {
                Mat tiles = image_copy(Range(y, y+M), Range(x, x+N));
                imwrite(&quot;patched/tile&quot; + a + '_' + b + &quot;.jpg&quot;, tiles);
                rectangle(image, Point(x,y), Point(x1,y1), Scalar(0, 255,0), 1);
            }
        }
    }
    
    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2670&quot; data-origin-height=&quot;1308&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/F6iGY/dJMcad1oMsi/Rx0ac6C6Tzxj94vK1e3nxK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/F6iGY/dJMcad1oMsi/Rx0ac6C6Tzxj94vK1e3nxK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/F6iGY/dJMcad1oMsi/Rx0ac6C6Tzxj94vK1e3nxK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FF6iGY%2FdJMcad1oMsi%2FRx0ac6C6Tzxj94vK1e3nxK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2670&quot; height=&quot;1308&quot; data-origin-width=&quot;2670&quot; data-origin-height=&quot;1308&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4. 이미지 rotation 및 translation&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 rotation 입니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762419584809&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    Point2f center((image.cols-1)/2.0, (image.rows-1)/2.0);

    Mat rotation_matrix = getRotationMatrix2D(center, 45, 1.0);
    Mat rotated_image;
    warpAffine(image, rotated_image, rotation_matrix, image.size());

    imwrite(&quot;rotate_image.jpg&quot;, rotated_image);
    
    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;rotate_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/GHJcV/dJMcaap4gSL/A3PGQEbcubhh1ahH9fKGPK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/GHJcV/dJMcaap4gSL/A3PGQEbcubhh1ahH9fKGPK/img.jpg&quot; data-alt=&quot;rotate_image.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/GHJcV/dJMcaap4gSL/A3PGQEbcubhh1ahH9fKGPK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FGHJcV%2FdJMcaap4gSL%2FA3PGQEbcubhh1ahH9fKGPK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;447&quot; height=&quot;298&quot; data-filename=&quot;rotate_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;rotate_image.jpg&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다음은 이미지 translation 입니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762419926566&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    Mat image =imread(&quot;../test.jpg&quot;);

    int height = image.cols;
    int width = image.rows;

    float tx = float(width) / 4;
    float ty = float(height) / 4;  
    
    float warp_values[] = {1.0, 0.0, tx, 0.0, 1.0, ty};
    Mat translation_matrix = Mat(2, 3, CV_32F, warp_values);

    Mat translated_image;
    warpAffine(image, translated_image, translation_matrix, image.size());

    imwrite(&quot;translated_image.jpg&quot;, translated_image);

    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;translated_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/birCbM/dJMcagw3kXo/NnlnhKuZPbhLvvYoofLFMK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/birCbM/dJMcagw3kXo/NnlnhKuZPbhLvvYoofLFMK/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/birCbM/dJMcagw3kXo/NnlnhKuZPbhLvvYoofLFMK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbirCbM%2FdJMcagw3kXo%2FNnlnhKuZPbhLvvYoofLFMK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;477&quot; height=&quot;318&quot; data-filename=&quot;translated_image.jpg&quot; data-origin-width=&quot;1024&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #ee2323;&quot;&gt;**기타&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762418517849&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;// 이미지 복사하기
Mat image_copy = image.clone();&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;** 참고자료&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #0593d3;&quot;&gt;&lt;a style=&quot;color: #0593d3;&quot; href=&quot;https://learnopencv.com/read-display-and-write-an-image-using-opencv/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/read-display-and-write-an-image-using-opencv/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #0593d3;&quot;&gt;&lt;a style=&quot;color: #0593d3;&quot; href=&quot;https://learnopencv.com/image-resizing-with-opencv/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/image-resizing-with-opencv/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #0593d3;&quot;&gt;&lt;a style=&quot;color: #0593d3;&quot; href=&quot;https://learnopencv.com/cropping-an-image-using-opencv/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/cropping-an-image-using-opencv/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #0593d3;&quot;&gt;&lt;a style=&quot;color: #0593d3;&quot; href=&quot;https://learnopencv.com/image-rotation-and-translation-using-opencv/&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://learnopencv.com/image-rotation-and-translation-using-opencv/&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/191</guid>
      <comments>https://ga02-ailab.tistory.com/191#entry191comment</comments>
      <pubDate>Wed, 12 Nov 2025 15:00:19 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] C++ OpenCV 설치 (with Linux)</title>
      <link>https://ga02-ailab.tistory.com/190</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1.&amp;nbsp; 먼저 OpenCV를 설치합니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762413415002&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;sudo apt update
sudo apt install libopencv-dev&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. 저는 cmake 를 사용할거기 때문에 CMakeLists.txt 파일을 작성해줍니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762413484860&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;cmake_minimum_required(VERSION 3.10)
project(OpenCVExample)

find_package(OpenCV REQUIRED)
include_directories(${OpenCV_INCLUDE_DIRS})

add_executable(main main.cpp)
target_link_libraries(main ${OpenCV_LIBS})&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이때 main.cpp&amp;nbsp; 파일이 작성되어 있어야 합니다. 저는 간단히 opencv 버전을 출력하도록 아래처럼 작성했어요.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762413734194&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;// main.cpp

#include &amp;lt;opencv2/opencv.hpp&amp;gt;
#include &amp;lt;iostream&amp;gt;
using namespace std;
using namespace cv;

int main() {
    cout &amp;lt;&amp;lt; &quot;OpenCV version : &quot; &amp;lt;&amp;lt; CV_VERSION &amp;lt;&amp;lt; std::endl;
    return 0;
}&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. 이후 빌드해줍니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1762413761371&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;mkdir build &amp;amp;&amp;amp; cd build
cmake ..
make
./main&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;빌드에 성공하면&amp;nbsp; OpenCV 버전이 출력됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;44&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/yLamb/dJMcajN42sJ/zNUksyjoO0daxoOl4VQ5A1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/yLamb/dJMcajN42sJ/zNUksyjoO0daxoOl4VQ5A1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/yLamb/dJMcajN42sJ/zNUksyjoO0daxoOl4VQ5A1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FyLamb%2FdJMcajN42sJ%2FzNUksyjoO0daxoOl4VQ5A1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;461&quot; height=&quot;32&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;44&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/190</guid>
      <comments>https://ga02-ailab.tistory.com/190#entry190comment</comments>
      <pubDate>Thu, 6 Nov 2025 16:23:30 +0900</pubDate>
    </item>
    <item>
      <title>[2] FLAVA: A Foundational Language And Vision Alignment Model</title>
      <link>https://ga02-ailab.tistory.com/189</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;[Paper]&lt;/span&gt; &lt;a href=&quot;https://arxiv.org/pdf/2112.04482&quot;&gt;https://arxiv.org/pdf/2112.04482&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;[Github]&lt;/span&gt; &lt;a href=&quot;https://github.com/facebookresearch/multimodal/tree/main/examples/flava&quot;&gt;https://github.com/facebookresearch/multimodal/tree/main/examples/flava&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1760928518293&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;multimodal/examples/flava at main &amp;middot; facebookresearch/multimodal&quot; data-og-description=&quot;TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale. - facebookresearch/multimodal&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/facebookresearch/multimodal/tree/main/examples/flava&quot; data-og-url=&quot;https://github.com/facebookresearch/multimodal/tree/main/examples/flava&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bVtz4X/hyZL6ZI921/oM40VegFKQC99WcFT8ib00/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/wRRJu/hyZL1KSBX2/n9vxOWdke4sgRIoTWKzS9K/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/facebookresearch/multimodal/tree/main/examples/flava&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/facebookresearch/multimodal/tree/main/examples/flava&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bVtz4X/hyZL6ZI921/oM40VegFKQC99WcFT8ib00/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/wRRJu/hyZL1KSBX2/n9vxOWdke4sgRIoTWKzS9K/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;multimodal/examples/flava at main &amp;middot; facebookresearch/multimodal&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale. - facebookresearch/multimodal&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Abstract&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존 연구의 문제점&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존의 vision과 VLM 모델들은 대규모 vision-language 모델 사전학습을 통해 성능을 향상시킴&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특정 modality 혹은 task에 초점을 맞춘 모델들이 다수&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;cross-modal(CLIP)과 multi-modal(Transformer) 중 하나만 활용&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; FLAVA는 이러한 한계를 넘어 하나의 총합적인 &amp;ldquo;foundation&amp;rdquo; 모델로서 vision task, language task, cross-modal task, multi-modal task 모두에서 우수한 성능을 달성하는 범용 vision-language 모델 을 구현&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;486&quot; data-origin-height=&quot;334&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c83md1/dJMb89SkiN0/ad7f2XPqsDzK6H7UbhCwM0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c83md1/dJMb89SkiN0/ad7f2XPqsDzK6H7UbhCwM0/img.png&quot; data-alt=&quot;masked image modeling (MIM) / mask language modeling (MLM) / masked multimodal modeling (MMM)&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c83md1/dJMb89SkiN0/ad7f2XPqsDzK6H7UbhCwM0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc83md1%2FdJMb89SkiN0%2Fad7f2XPqsDzK6H7UbhCwM0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;442&quot; height=&quot;304&quot; data-origin-width=&quot;486&quot; data-origin-height=&quot;334&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;masked image modeling (MIM) / mask language modeling (MLM) / masked multimodal modeling (MMM)&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;cross-modal과 multi-modal의 문제점&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;cross-modal: 한 모달리티에서 다른 모달리티를 예측/검색 &amp;rArr; Fusion Task에서 약함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multi-modal: 여러 모달리티를 동시에 처리 &amp;rArr; 단일 모달리티 성능이 낮아질 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;FLAVA: Foundational Language And Vision Alignment Model&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multimodal(이미지-텍스트 쌍) 와 unimodal(image/text only) 데이터를 결합하여 학습,&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;모든 모달리티에서 시각 및 언어의 강력한 표현을 학습하는 범용 모델이 됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;masking 기반 학습을 적용하여 강력한 representation을 학습 가능&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;단순한 contrastive learning 기반이 아닌 다양한 형태의 데이터에서 representation을 학습하는게 가능해짐&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;35개 task에서 모델의 우수함 검증&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;✅ Masking 기반의 학습 방법&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;= reconstruction 기반&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;데이터 일부를 가려놓고 복원하도록 학습 &amp;rArr; 표현력을 크게 향상시킬 수 있는 학습 방법&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;마스킹된 부분을 복원하면서 멀티모달 정보의 상관관계를 학습하는 방식&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;중간 표현 및 부분적인 재구성 관점에서 접근&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. Method&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;980&quot; data-origin-height=&quot;364&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcje2d/dJMb9WFqG78/so8ZwdlOZlF7Jx3nOwxKuk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcje2d/dJMb9WFqG78/so8ZwdlOZlF7Jx3nOwxKuk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcje2d/dJMb9WFqG78/so8ZwdlOZlF7Jx3nOwxKuk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbcje2d%2FdJMb9WFqG78%2Fso8ZwdlOZlF7Jx3nOwxKuk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;980&quot; height=&quot;364&quot; data-origin-width=&quot;980&quot; data-origin-height=&quot;364&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #006dd7;&quot;&gt;3.1. The model architecture&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Transformer 기반&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3가지 부분으로 구성 됨&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Unimodal: image encoder, text encoder&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multi modal : multimodal encoder&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;image encoder&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;ViT 구조 (ViT-B/16)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;resizing &amp;rarr; patching &amp;rarr; positional embedding &amp;rarr; hidden state vector &lt;b&gt;h_I&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;classification head 추가 (downstream task 수행을 위한 것) &amp;rArr; [CLS_I]&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;text encoder&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;BERT 기반의 Transformer&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;tokenization &amp;rarr; token embedding &amp;rarr; hidden state vector &lt;b&gt;h_T&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;classification head 추가 (downstream task 수행을 위한 것) &amp;rArr; [CLS_T]&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;multimodal encoder&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;각 image encoder와 text encoder를 통과하여 얻어진 hidden state vector(h_I, h_T)에 각각 linear projection 적용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;각 unimodal encoder에서 나온 hidden representation을 결합하여 fusion representation 학습 및 masking 된 부분 복원&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이후 이를 단일 리스트로 병합(concat)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multimodal classification을 위한 special token을 추가 &amp;rArr; [CLS_M]&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp; FLAVA는 하나의 거대한 네트워크 안에서&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- image only,&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- text only,&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- image-text pair 입력&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 3가지를 모두 하나의 모델로 처리하기 때문에, 각각의 경우에 맞는 objectives 들을 정의&lt;/span&gt;&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #006dd7;&quot;&gt;&lt;b&gt;3.2. Multimodal pre-training objectives&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;Global contrastive (GC) loss&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;CLIP 방식과 유사&amp;rArr; h_I와 h_T로 contrastive learning&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;각 h_CLS,I 및 h_CLS,T를 임베딩 공간에 linear projection 후 L2 정규화/내적 및 temperature에 따라 조정된 softmax loss 계산&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;image-text pair의 관계(연관성) 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;Masked multimodal modeling (MMM)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;text input과 이미지 패치 모두 masking 적용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지: 이미지를 패치 단위로 나눈 후 임의의 패치를 선택하여 masking&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;텍스트: 텍스트 토큰 중 15%를 임의로 골라 [MASK] 토큰으로 대체, BERT 기반&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multimodal encoder의 output({h_M})을 multi-layer perceptron을 통해 처리하여 masking된 데이터를 복원&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지: masking 된 이미지 패치의 visual codebook index 예측&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;codebook?&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지 패치를 이산적 벡터로 매핑하기 위한 &amp;ldquo;시각 어휘집(visual vocabulary)&amp;rdquo;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;텍스트: masking 된 텍스트 토큰의 word vocabulary index 예측&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;즉, 한 modality의 정보로 다른 modality의 가려진 부분을 예측할 수 있게 만듦.&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;I&lt;b&gt;mage-text matching (ITM)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multimodal encoder의 [CLS_M] 벡터를 이용해 image와 text가 실제로 매칭되는지 판단&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;3.3. Unimodal pre-training objectives&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2048&quot; data-origin-height=&quot;512&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/z7DmY/dJMb9LqrBSZ/8X99uakHLuEeeczK3vUWA1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/z7DmY/dJMb9LqrBSZ/8X99uakHLuEeeczK3vUWA1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/z7DmY/dJMb9LqrBSZ/8X99uakHLuEeeczK3vUWA1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fz7DmY%2FdJMb9LqrBSZ%2F8X99uakHLuEeeczK3vUWA1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2048&quot; height=&quot;512&quot; data-origin-width=&quot;2048&quot; data-origin-height=&quot;512&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;Masked image modeling (MIM)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;패치 일부를 마스킹하고, 모델이 마스킹된 부분의 픽셀 값을 예측하도록 학습하는 방법&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지 자체의 구조와 패턴을 이해 할 수 있음 &amp;rarr; 시각적 표현 능력 향상&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #1b711d;&quot;&gt;&lt;b&gt;Masked language modeling (MLM)&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;일부 단어나 토큰 마스킹&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;언어의 문맥과 의미를 효과적으로 파악 &amp;rarr; 언어적 표현 능력 향상&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style3&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;&lt;/b&gt;&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1808&quot; data-origin-height=&quot;822&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bsw2Nv/dJMb9LqrBTf/dQKcSWGOP1plUnFjyheFl1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bsw2Nv/dJMb9LqrBTf/dQKcSWGOP1plUnFjyheFl1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bsw2Nv/dJMb9LqrBTf/dQKcSWGOP1plUnFjyheFl1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbsw2Nv%2FdJMb9LqrBTf%2FdQKcSWGOP1plUnFjyheFl1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1808&quot; height=&quot;822&quot; data-origin-width=&quot;1808&quot; data-origin-height=&quot;822&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #006dd7;&quot;&gt;&lt;b&gt;3.4. Implementation details&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;batch size = 8192&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;lr = 1e-3&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;optimizer = AdamW&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;세부 학습 방법&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;이미지 단독(batch of images)&lt;/b&gt; &amp;rarr; Masked Image Modeling (MIM)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;텍스트 단독(batch of text)&lt;/b&gt; &amp;rarr; Masked Language Modeling (MLM)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;이미지&amp;ndash;텍스트 쌍(batch of pairs)&lt;/b&gt; &amp;rarr; Contrastive + Matching + Masked Multimodal (GC + ITM + MMM)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;세 가지 데이터들을 &lt;b&gt;라운드 로빈&lt;/b&gt; 방식으로 샘플링&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;한 iteration에서는 이미지 전용 데이터 (ImageNet batch) &amp;rarr; MIM loss 계산&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다음 iteration에서는 텍스트 전용 데이터 (BookCorpus batch) &amp;rarr; MLM loss 계산&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그 다음 iteration에서는 이미지-텍스트 쌍 데이터 (COCO, CC12M) &amp;rarr; GC + ITM + MMM 계산&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 이 세 가지 batch 종류를 반복하면서 &lt;b&gt;모든 인코더의 파라미터를 동시에 업데이트&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #006dd7; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.5. Data: Public Multimodal Datasets (PMD)&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dgF7KU/dJMb9PM9sB9/0yedkFvdxztpRI43kNRAvk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dgF7KU/dJMb9PM9sB9/0yedkFvdxztpRI43kNRAvk/img.png&quot; data-origin-width=&quot;990&quot; data-origin-height=&quot;211&quot; data-is-animation=&quot;false&quot; style=&quot;width: 69.9899%; margin-right: 10px;&quot; data-widthpercent=&quot;70.81&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dgF7KU/dJMb9PM9sB9/0yedkFvdxztpRI43kNRAvk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdgF7KU%2FdJMb9PM9sB9%2F0yedkFvdxztpRI43kNRAvk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;990&quot; height=&quot;211&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cLtF2O/dJMb9OHsQSu/AWFGPWu8a9UCP0qJGssTd1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cLtF2O/dJMb9OHsQSu/AWFGPWu8a9UCP0qJGssTd1/img.png&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;257&quot; data-is-animation=&quot;false&quot; style=&quot;width: 28.8473%;&quot; data-widthpercent=&quot;29.19&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cLtF2O/dJMb9OHsQSu/AWFGPWu8a9UCP0qJGssTd1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcLtF2O%2FdJMb9OHsQSu%2FAWFGPWu8a9UCP0qJGssTd1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;497&quot; height=&quot;257&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;저자들이 직접 만든 Public Multimodal Datasets (PMD) 사용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;약 7천만 쌍의 데이터&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;총 이미지 수 약 6,800만장, 평균 캡션 길이 12.1단어&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Visual Genome, Conceptual Captions등 공개 데이터셋만 사용하였기에 연구 재현성과 향후 확장에 용이함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #99cefa;&quot;&gt;4. Experiments&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Comparison to state-of-the-art models&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1137&quot; data-origin-height=&quot;340&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wjyE9/dJMb80A44Y2/uGOtqPqo7KDspu9q0mqlxK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wjyE9/dJMb80A44Y2/uGOtqPqo7KDspu9q0mqlxK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wjyE9/dJMb80A44Y2/uGOtqPqo7KDspu9q0mqlxK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwjyE9%2FdJMb80A44Y2%2FuGOtqPqo7KDspu9q0mqlxK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1137&quot; height=&quot;340&quot; data-origin-width=&quot;1137&quot; data-origin-height=&quot;340&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;ablation study&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Full FLAVA model&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1006&quot; data-origin-height=&quot;362&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dFKCxX/dJMb9OHsQTF/5B69ryD93XQQIiQdHuiLyk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dFKCxX/dJMb9OHsQTF/5B69ryD93XQQIiQdHuiLyk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dFKCxX/dJMb9OHsQTF/5B69ryD93XQQIiQdHuiLyk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdFKCxX%2FdJMb9OHsQTF%2F5B69ryD93XQQIiQdHuiLyk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1006&quot; height=&quot;362&quot; data-origin-width=&quot;1006&quot; data-origin-height=&quot;362&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1, 2: unimodal-only&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;image encoder 및 text encoder를 독립적으로 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3, 4: multimodal-only&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3번) contrastive learning을 통해 두 모달리티 간 연관성 학습&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4번) multimodal encoder를 통한 fusion representation 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;5, 6: unimodal + multimodal&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;전체 FLAVA model 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #99cefa;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;5. Limitations&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;학습 효율은 높지만, 3개의 인코더가 있으므로 &lt;b&gt;메모리 사용량&lt;/b&gt;이 큼.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;multimodal encoder의 cross attention을 사용하는 대신 단순 concat 방식을 사용해서 fine-grained 관계 학습이 부족할 수 있음.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rarr; 이후 연구들(&lt;b&gt;BLIP, ALBEF2, Flamingo, Kosmos-2.5&lt;/b&gt; 등)이 이를 개선&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/LLM &amp;amp; VLM</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/189</guid>
      <comments>https://ga02-ailab.tistory.com/189#entry189comment</comments>
      <pubDate>Mon, 20 Oct 2025 12:03:17 +0900</pubDate>
    </item>
    <item>
      <title>NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.</title>
      <link>https://ga02-ailab.tistory.com/188</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #c1bef9; color: #000000;&quot;&gt;&lt;b&gt;[문제 상황]&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;python 멀티 스레드 실행시, model을 .to('cpu') 또는 .to('cuda')등 모델을 cpu나 gpu에 올릴때 발생하는 에러입니다. (멀티스레드로 실행하지 않으면 에러 발생하지 않음)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[문제 원인]&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;메타로 만들어진 모듈을 아직 실제 텐서로 채우기 전에&amp;nbsp;&lt;/span&gt;.to()&lt;span style=&quot;text-align: start;&quot;&gt;를 호출하는 코드 경로가 스레드에서만 발생하기 때문이에요. 보통 아래 패턴들 때문에 스레드 경로에서만 메타가 남습니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #000000; text-align: start;&quot; data-end=&quot;171&quot; data-start=&quot;148&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #000000; text-align: start;&quot; data-end=&quot;171&quot; data-start=&quot;148&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 왜 스레드에서만 메타 텐서가 남을까?&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;207&quot; data-start=&quot;173&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;207&quot; data-start=&quot;173&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;모델 생성/로드 순서가 스레드와 메인에서 다름&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;568&quot; data-start=&quot;208&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;297&quot; data-start=&quot;208&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;메인 스레드:&amp;nbsp;from_pretrained(..., low_cpu_mem_usage=False)&amp;nbsp;처럼 바로 실제 텐서로 초기화 &amp;rarr;&amp;nbsp;.to()&amp;nbsp;가능&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;568&quot; data-start=&quot;298&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;워커 스레드:&amp;nbsp;from_pretrained(..., low_cpu_mem_usage=True)&amp;nbsp;또는&amp;nbsp;device_map=&quot;meta&quot;/init_empty_weights()&amp;nbsp;경로 &amp;rarr;&amp;nbsp;&lt;b&gt;메타로 빈 모듈 생성&lt;/b&gt;&amp;nbsp;&amp;rarr; (여기서 곧바로&amp;nbsp;.to()&amp;nbsp;호출하면 에러)&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-end=&quot;568&quot; data-start=&quot;454&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;568&quot; data-start=&quot;454&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특히&amp;nbsp;low_cpu_mem_usage=True는 내부에서 &amp;ldquo;메타로 초기화 &amp;rarr; state_dict 주입&amp;rdquo; 2단계를 거치는데,&amp;nbsp;&lt;b&gt;스레드 경로에서 주입 전&amp;nbsp;.to()가 먼저 실행&lt;/b&gt;되면 실패합니다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;627&quot; data-start=&quot;570&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;627&quot; data-start=&quot;570&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;모델을 메인 스레드에서 메타 상태로 만든 뒤, 스레드에 넘겨&amp;nbsp;.to()&amp;nbsp;하는 케이스&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;682&quot; data-start=&quot;628&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;682&quot; data-start=&quot;628&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;객체를 넘겨받은 스레드 쪽에서 가중치를 로드하기 전에&amp;nbsp;.to()가 먼저 불려도 동일 오류.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;723&quot; data-start=&quot;684&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;723&quot; data-start=&quot;684&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;Accelerate/HF의 컨텍스트가 스레드마다 달라짐&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;823&quot; data-start=&quot;724&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;823&quot; data-start=&quot;724&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;init_empty_weights()&amp;nbsp;같은 컨텍스트/글로벌 플래그가&amp;nbsp;&lt;b&gt;스레드 로컬&lt;/b&gt;로 다르게 적용돼 서로 다른 경로(메타 경로 vs 일반 경로)로 분기될 수 있습니다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;857&quot; data-start=&quot;825&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;857&quot; data-start=&quot;825&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;CUDA 디바이스 초기화가 스레드에서 늦음&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;1034&quot; data-start=&quot;858&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;1034&quot; data-start=&quot;858&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;스레드 시작 직후&amp;nbsp;torch.cuda.set_device(...)&amp;nbsp;안 해두고, 메타 상태 모델을&amp;nbsp;.to(&quot;cuda&quot;)로 옮기려다 내부에서 장치/컨텍스트 초기화 타이밍이 엇갈리면, 가중치 주입 전에 장치 이동이 먼저 호출되는 흐름이 생길 수 있습니다. (직접 원인은 메타 복사 불가지만, 타이밍을 더 악화)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;가중치 주입 전에&amp;nbsp;&lt;/span&gt;.to()&lt;span style=&quot;text-align: start;&quot;&gt;가 먼저 실행되는 타이밍 문제&amp;rdquo;라서,&amp;nbsp;&lt;/span&gt;&lt;b&gt;임계 구간을 잠그면&lt;/b&gt;&lt;span style=&quot;text-align: start;&quot;&gt;&amp;nbsp;안정화할 수 있습니다. 핵심은&amp;nbsp;&lt;/span&gt;&lt;b&gt;메타&amp;rarr;빈텐서&amp;rarr;가중치주입&lt;/b&gt;&lt;span style=&quot;text-align: start;&quot;&gt;(또는&amp;nbsp;&lt;/span&gt;&lt;b&gt;일반 로드&lt;/b&gt;&lt;span style=&quot;text-align: start;&quot;&gt;)의 순서를 스레드 간에&amp;nbsp;&lt;/span&gt;항상&lt;span style=&quot;text-align: start;&quot;&gt;&amp;nbsp;보장하는 거예요.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; text-align: start; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[해결방법]&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1757570022864&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import threading
device_lock = threading.Lock()

...

with device_lock:
	base_model_pipe = StableDiffusionXLPipeline.from_pretrained(base_model_with_sub_path, torch_dtype=torch.float16, variant=&quot;fp16&quot;, use_safetensors=True,device_map=None)
    base_model_pipe.to('cpu')

...&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/188</guid>
      <comments>https://ga02-ailab.tistory.com/188#entry188comment</comments>
      <pubDate>Thu, 25 Sep 2025 15:00:14 +0900</pubDate>
    </item>
    <item>
      <title>[Pytorch] python 멀티스레드와 torch.cuda.empty_cache()</title>
      <link>https://ga02-ailab.tistory.com/187</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[문제 현상]&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;python에서 멀티스레드로 각 스레드마다 사용 해야 할 gpu인덱스 설정 후, &amp;nbsp;딥러닝 모델을 잠시 cpu로 내리고 torch.cuda.empty_cache() 를 해주면 지정한 gpu외에 갑자기 0번 메모리가 잡히기 시작하는 현상이 있었습니다. chatgpt에 의한 원인과 해결방법은 아래와 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[원인 및 해결방법] - by ChatGPT&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;torch.cuda.empty_cache()&amp;nbsp;는&amp;nbsp;&lt;b&gt;&amp;ldquo;현재 스레드의 현재 디바이스(current device)&amp;rdquo;의 캐시만&lt;/b&gt;&amp;nbsp;비우는데, 이걸 부르는 순간&amp;nbsp;&lt;b&gt;그 디바이스의 CUDA 컨텍스트가 없으면 새로 초기화&lt;/b&gt;해요. 스레드에서&amp;nbsp;current device를 안 정했으면 기본이&amp;nbsp;&lt;b&gt;cuda:0&lt;/b&gt;&amp;nbsp;이라서, 빈 캐시를 비우려다 오히려&amp;nbsp;&lt;b&gt;GPU0 컨텍스트가 생성&amp;rarr;수 MB가 사용&lt;/b&gt;된 것처럼 보이는 거예요.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #000000; text-align: start;&quot; data-end=&quot;250&quot; data-start=&quot;247&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;요약:&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;406&quot; data-start=&quot;251&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;294&quot; data-start=&quot;251&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;empty_cache()&amp;nbsp;자체가 메모리를 &amp;ldquo;할당&amp;rdquo;하는 함수는 아니지만,&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;357&quot; data-start=&quot;295&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;컨텍스트 초기화 비용&lt;/b&gt;(드라이버 핸들/할당자 구조체 등)이 생겨서 GPU0에 소량 메모리가 올라갑니다.&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;406&quot; data-start=&quot;358&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;멀티스레드에선&amp;nbsp;&lt;b&gt;스레드별로 current device 를 따로&lt;/b&gt;&amp;nbsp;고정해야 해요.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;좀 더 자세한 추론:&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;117&quot; data-start=&quot;74&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;117&quot; data-start=&quot;74&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;PyTorch의 CUDA는 &amp;lsquo;지연 초기화(lazy init)&amp;rsquo;&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;218&quot; data-start=&quot;118&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;218&quot; data-start=&quot;118&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;torch.cuda는&amp;nbsp;&lt;b&gt;지연 초기화&lt;/b&gt;되며,&amp;nbsp;&lt;b&gt;첫 CUDA 호출 시&lt;/b&gt;&amp;nbsp;드라이버/컨텍스트가 초기화됨.&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://docs.pytorch.org/docs/stable/cuda?utm_source=chatgpt.com&quot;&gt;PyTorch Docs&lt;/a&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://discuss.pytorch.org/t/cuda-manual-startup/90370?utm_source=chatgpt.com&quot;&gt;PyTorch Forums+1&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;269&quot; data-start=&quot;220&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;269&quot; data-start=&quot;220&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;torch.cuda.empty_cache()가 하는 일(캐시만 비움)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;548&quot; data-start=&quot;270&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;371&quot; data-start=&quot;270&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;ldquo;캐싱 할당자에 남은 사용되지 않는 메모리&amp;rdquo;를 해제하는 함수라는 공식 문서. (PyTorch 2.8 기준)&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://docs.pytorch.org/docs/stable/generated/torch.cuda.memory.empty_cache.html&quot;&gt;PyTorch Docs&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;548&quot; data-start=&quot;372&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;메모리/유틸 API들은&amp;nbsp;&lt;b&gt;장치를 명시하지 않으면 &amp;lsquo;현재 디바이스(current device)&amp;rsquo;&lt;/b&gt;&amp;nbsp;기준으로 동작한다는 문서(동일 섹션의 memory API들 설명). 이로부터&amp;nbsp;empty_cache()도 현재 디바이스 대상임을 알 수 있음.&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://docs.pytorch.org/docs/stable/generated/torch.cuda.memory.memory_reserved.html?utm_source=chatgpt.com&quot;&gt;PyTorch Docs+2PyTorch Docs+2&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;599&quot; data-start=&quot;550&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;599&quot; data-start=&quot;550&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;&amp;lsquo;현재 디바이스(current device)&amp;rsquo;의 기본값은&amp;nbsp;cuda:0&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;698&quot; data-start=&quot;600&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;698&quot; data-start=&quot;600&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;PyTorch 팀의 Shen Li 가 포럼에서 &amp;ldquo;현재 디바이스는 기본적으로&amp;nbsp;cuda:0&amp;rdquo;이라고 명시.&amp;nbsp;&lt;span style=&quot;color: #006dd7;&quot;&gt;&lt;a style=&quot;color: #006dd7;&quot; href=&quot;https://discuss.pytorch.org/t/should-local-rank-be-equal-to-torch-cuda-current-device/150873&quot;&gt;PyTorch Forums&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;ol style=&quot;list-style-type: decimal; color: #000000; text-align: start;&quot; data-end=&quot;740&quot; data-start=&quot;700&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li data-end=&quot;740&quot; data-start=&quot;700&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;스레드마다 디바이스 상태를 따로 가진다(멀티스레딩 주의)&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;877&quot; data-start=&quot;741&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;877&quot; data-start=&quot;741&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;NVIDIA 공식 블로그: 새 호스트 스레드를 만들면&amp;nbsp;&lt;b&gt;반드시&lt;/b&gt;&amp;nbsp;cudaSetDevice()로 그 스레드의 현재 디바이스를 설정하라고 권고(스레드별 디바이스 상태).&amp;nbsp;&lt;a style=&quot;color: #000000;&quot; href=&quot;https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/?utm_source=chatgpt.com&quot;&gt;NVIDIA Developer&lt;/a&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p style=&quot;color: #000000; text-align: start;&quot; data-end=&quot;904&quot; data-start=&quot;882&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 근거를 연결한 &amp;ldquo;정리&amp;rdquo;(추론)&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc; color: #000000; text-align: start;&quot; data-end=&quot;1180&quot; data-start=&quot;905&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;1180&quot; data-start=&quot;905&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;첫 CUDA 호출이 컨텍스트를 만든다&lt;/b&gt;(지연 초기화) &amp;rarr;&amp;nbsp;&lt;b&gt;empty_cache()도 CUDA 관련 호출&lt;/b&gt;이다 &amp;rarr; 스레드에서&amp;nbsp;&lt;b&gt;현재 디바이스를 지정하지 않았다면 기본이&amp;nbsp;cuda:0&lt;/b&gt;&amp;nbsp;&amp;rarr; 그 상태에서&amp;nbsp;empty_cache()를 부르면&amp;nbsp;&lt;b&gt;해당 스레드에서 GPU0 컨텍스트가 생성&lt;/b&gt;되며, 그래서&amp;nbsp;&lt;b&gt;GPU0에 소량의 메모리 사용이 보일 수 있다&lt;/b&gt;. (2)&amp;middot;(3)&amp;middot;(1)&amp;middot;(4)를 종합한 추론입니다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;해결방법:&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; text-align: start; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;=&amp;gt; 해당 GPU에서만 비우기 (스레드 안전)&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1757568430722&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;with torch.cuda.device(self.train_device):
            torch.cuda.empty_cache()&lt;/code&gt;&lt;/pre&gt;</description>
      <category>Pytorch</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/187</guid>
      <comments>https://ga02-ailab.tistory.com/187#entry187comment</comments>
      <pubDate>Thu, 11 Sep 2025 14:33:39 +0900</pubDate>
    </item>
    <item>
      <title>[1] Learning Transferable Visual Models From Natural Language Supervision(CLIP)</title>
      <link>https://ga02-ailab.tistory.com/186</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[Paper] &lt;a href=&quot;https://arxiv.org/pdf/2103.00020&quot;&gt;https://arxiv.org/pdf/2103.00020&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[Github] &lt;a href=&quot;https://github.com/OpenAI/CLIP&quot;&gt;https://github.com/OpenAI/CLIP&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1756645024878&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining),  Predict the most relevant text snippet given an image&quot; data-og-description=&quot;CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image - openai/CLIP&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/OpenAI/CLIP&quot; data-og-url=&quot;https://github.com/openai/CLIP&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cvKVtG/hyZGjdOiIE/kLM9GNBmdREfMayhw5JBb1/img.png?width=1280&amp;amp;height=640&amp;amp;face=0_0_1280_640,https://scrap.kakaocdn.net/dn/8ladR/hyZF997NMT/ugNZLSqUlN5NNOqkjtHvi0/img.png?width=1280&amp;amp;height=640&amp;amp;face=0_0_1280_640&quot;&gt;&lt;a href=&quot;https://github.com/OpenAI/CLIP&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/OpenAI/CLIP&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cvKVtG/hyZGjdOiIE/kLM9GNBmdREfMayhw5JBb1/img.png?width=1280&amp;amp;height=640&amp;amp;face=0_0_1280_640,https://scrap.kakaocdn.net/dn/8ladR/hyZF997NMT/ugNZLSqUlN5NNOqkjtHvi0/img.png?width=1280&amp;amp;height=640&amp;amp;face=0_0_1280_640');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image - openai/CLIP&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #c1bef9;&quot;&gt;&lt;b&gt;1. Abstract&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;기존 연구의 문제점: 한정된 클래스만을 가지고 학습 &amp;rArr; generality 와 usability에 한계가 있음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;이러한 문제점을 해결하기 위해&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;CLIP 모델 제안 (Contrastive Language-Image Pre-Training)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;대규모 데이터셋 구축&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;4억개의 image-text 쌍 데이터를 수집 &amp;rArr; 광범위한 vision-language 포괄성 확보&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;zero-shot 분류 성능 확보&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;CLIP의 텍스트 인코더를 사용해 zero shot 이미지 분류 수행&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;다양한 downstream task에서 높은 일반화 성능을 보임&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;OCR, action recognition 등&amp;hellip;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #c1bef9;&quot;&gt;&lt;b&gt;2. Approach&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;2.1 Natural Language Supervision&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;579&quot; data-origin-height=&quot;215&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bsWsrh/btsQdALeghN/kY812Xtrzw75GWku3Vm2k0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bsWsrh/btsQdALeghN/kY812Xtrzw75GWku3Vm2k0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bsWsrh/btsQdALeghN/kY812Xtrzw75GWku3Vm2k0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbsWsrh%2FbtsQdALeghN%2FkY812Xtrzw75GWku3Vm2k0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;579&quot; height=&quot;215&quot; data-origin-width=&quot;579&quot; data-origin-height=&quot;215&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;핵심 접근법: 자연어에 포함된 supervision을 통해 perception 학습&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;자연어 학습의 장점&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;입력되는 이미지에 &amp;ldquo;dog&amp;rdquo;라는 label 대신에 이미지를 설명하는 자연어 문장을 그대로 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;확장성이 뛰어남 &amp;rArr; 단순히 이미지의 표현만을 학습하는 것이 아니라, 표현을 언어에 연결하여 유연하게 zero shot이 가능하게 함.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;2.2. Creating a Sufficiently Large Dataset&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;4억개의 image-text 쌍으로 구성된 새로운 데이터셋 구축&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;텍스트에는 영어 Wikipedia 단어 50만 개, 인기 검색어 등을 query로 사용하여 폭넓은 시각 개념을 포괄하도록 하였고, query당 최대 2만 쌍의 결과를 포함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;query란?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;이미지&amp;ndash;텍스트 쌍을 모으기 위해 사용된 검색 키워드.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;query를 사용한 이유&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;단순히 무작위 웹 이미지를 수집하면, 텍스트가 없는 이미지도 많고 편향이 심함.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;따라서 다양한 query 목록을 미리 준비해서, 각 query에 대해 검색 &amp;rarr; 관련 이미지와 텍스트를 수집 &amp;rarr; 데이터셋에 포함.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;이렇게 하면 다양한 시각 개념(동물, 음식, 장소, 사람 행동 등)을 고르게 포함시킬 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;2.3. Selecting an Efficient Pre-Training Method&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;contrastive learning 방식으로 진행&lt;/span&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;batch에 N개의 이미지와 N개의 텍스트 설명을 짝지어 넣음 &amp;rArr; 각 이미지와 텍스트 쌍이 옳은 쌍이고 그 이외의 N^2 - N 쌍은 &lt;b&gt;잘못된 쌍&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;이미지-텍스트 임베딩 간 cosine similarity&lt;/b&gt;를 계산하여, 올바른 쌍의 유사도는 최대화하고 잘못된 쌍의 유사도는 최소화&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&amp;rArr; 이미지와 텍스트 표현을 multi-modal embedding space에서 서로 가깝게 맞추는 역할을 하며, 결과적으로 모델이 광범위한 시각/언어 개념을 획득하도록 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wduiE/btsQd9GmM93/LCvbMKCK1vG2tLVLlsoW40/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wduiE/btsQd9GmM93/LCvbMKCK1vG2tLVLlsoW40/img.png&quot; data-origin-width=&quot;397&quot; data-origin-height=&quot;425&quot; data-is-animation=&quot;false&quot; style=&quot;width: 44.0853%; margin-right: 10px;&quot; data-widthpercent=&quot;44.6&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wduiE/btsQd9GmM93/LCvbMKCK1vG2tLVLlsoW40/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwduiE%2FbtsQd9GmM93%2FLCvbMKCK1vG2tLVLlsoW40%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;397&quot; height=&quot;425&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/LeqTH/btsQeTXhnZU/1bSZnCFFkK9nU7gvteuyy0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/LeqTH/btsQeTXhnZU/1bSZnCFFkK9nU7gvteuyy0/img.png&quot; data-origin-width=&quot;710&quot; data-origin-height=&quot;612&quot; data-is-animation=&quot;false&quot; style=&quot;width: 54.7519%;&quot; data-widthpercent=&quot;55.4&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/LeqTH/btsQeTXhnZU/1bSZnCFFkK9nU7gvteuyy0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLeqTH%2FbtsQeTXhnZU%2F1bSZnCFFkK9nU7gvteuyy0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;710&quot; height=&quot;612&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;2.4. Choosing and Scaling a Model&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Image Encoder: ResNet-50과 ViT를 기본 아키텍처로 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Text Encoder: Transformer 사용 (8 attention head)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;2.5. Training&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;818&quot; data-origin-height=&quot;299&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/A2jgp/btsQdQG5ppg/snThBM8tVExvHUkEULvPp1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/A2jgp/btsQdQG5ppg/snThBM8tVExvHUkEULvPp1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/A2jgp/btsQdQG5ppg/snThBM8tVExvHUkEULvPp1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FA2jgp%2FbtsQdQG5ppg%2FsnThBM8tVExvHUkEULvPp1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;818&quot; height=&quot;299&quot; data-origin-width=&quot;818&quot; data-origin-height=&quot;299&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;contrastive pretraining&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;optimizer: Adam&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;scheduler: Cosine learning rate scheduler&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;batch size: 32,768&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;학습 소요시간&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;가장 큰 모델 ResNet50 * 64 : 592개의 GPU로 18일 소요&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;가장 큰 ViT-L/14 모델: 256개 GPU로 12일간 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;zero-shot prediction&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;미리 준비된 text set&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Image Encoder와 Text Encoder를 이용해 각각의 feature 추출&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;추출된 image feature와 가장 유사한 text feature 찾아냄&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #c1bef9; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3. Experiments&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.1. Zero-Shot Transfer&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;410&quot; data-origin-height=&quot;446&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bptFIy/btsQeWGsCuO/kQFPCElklRrwtkkCr1ZpXk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bptFIy/btsQeWGsCuO/kQFPCElklRrwtkkCr1ZpXk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bptFIy/btsQeWGsCuO/kQFPCElklRrwtkkCr1ZpXk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbptFIy%2FbtsQeWGsCuO%2FkQFPCElklRrwtkkCr1ZpXk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;410&quot; height=&quot;446&quot; data-origin-width=&quot;410&quot; data-origin-height=&quot;446&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;27개중 16개의 데이터셋에서 CLIP의 Zero-Shot Prediction이 Linear Probe(분류기만 fine tuning한 것) 보다 성능이 좋음&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;손글씨, 품종분류, 의료 이미지, 위성 이미지에서는 낮은 성능을 보임 &amp;rArr; CLIP의 분포 범위 한계를 보여줌&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.2 Few shot learning&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;404&quot; data-origin-height=&quot;387&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/s3Sin/btsQgr6H6Nf/zhLFcdVltbN4aVqFunSxw1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/s3Sin/btsQgr6H6Nf/zhLFcdVltbN4aVqFunSxw1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/s3Sin/btsQgr6H6Nf/zhLFcdVltbN4aVqFunSxw1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fs3Sin%2FbtsQgr6H6Nf%2FzhLFcdVltbN4aVqFunSxw1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;404&quot; height=&quot;387&quot; data-origin-width=&quot;404&quot; data-origin-height=&quot;387&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;x축: 학습 샘플의 갯수&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;zero shot, few shot 모두 다른 모델들에 비해 성능 좋음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.3 Representation learning&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;824&quot; data-origin-height=&quot;524&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/RUtyD/btsQd7BJ1W1/k2S9d5iKKKIkbRkhqIG63k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/RUtyD/btsQd7BJ1W1/k2S9d5iKKKIkbRkhqIG63k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/RUtyD/btsQd7BJ1W1/k2S9d5iKKKIkbRkhqIG63k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FRUtyD%2FbtsQd7BJ1W1%2Fk2S9d5iKKKIkbRkhqIG63k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;824&quot; height=&quot;524&quot; data-origin-width=&quot;824&quot; data-origin-height=&quot;524&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;CLIP을 포함한 모든 모델에 Linear probe를 이용한 성능을 비교한 그림.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;각각 12개, 27개의 데이터셋에 대한 점수를 평균낸 것.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;모든 크기에서 CLIP모델이 다른 모델보다 좋은 성능을 냄&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.4 Robustness to Natural Distribution Shift&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;755&quot; data-origin-height=&quot;301&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cAl5EJ/btsQfGbZ1IB/OlIswLgbixD73BpuKEVZKk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cAl5EJ/btsQfGbZ1IB/OlIswLgbixD73BpuKEVZKk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cAl5EJ/btsQfGbZ1IB/OlIswLgbixD73BpuKEVZKk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcAl5EJ%2FbtsQfGbZ1IB%2FOlIswLgbixD73BpuKEVZKk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;755&quot; height=&quot;301&quot; data-origin-width=&quot;755&quot; data-origin-height=&quot;301&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Image Net에서는 이미지의 각도나 배경 등 분포가 살짝만 달라져도 정확도가 훨씬 떨어짐&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;CLIP은 4억쌍의 방대한 양의 데이터로 학습했기 때문에 특정 배경, 각도 등에 덜 치우치게 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;또한 고양이의 귀, 꼬리, 수염 등 핵심 특징을 언어로 학습하기 때문에 이러한 문제에서 보다 robust하다는 장점을 가짐&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt; 3.5 Comparison to Human Performance&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;380&quot; data-origin-height=&quot;132&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bcvzn4/btsQgrr5Y0Y/rQugJ7Q55HWkoGLF4d39W0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bcvzn4/btsQgrr5Y0Y/rQugJ7Q55HWkoGLF4d39W0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bcvzn4/btsQgrr5Y0Y/rQugJ7Q55HWkoGLF4d39W0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbcvzn4%2FbtsQgrr5Y0Y%2FrQugJ7Q55HWkoGLF4d39W0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;380&quot; height=&quot;132&quot; data-origin-width=&quot;380&quot; data-origin-height=&quot;132&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Oxford-IIIT Pet(고양이&amp;middot;개 품종 분류, 클래스 37개) 데이터셋의 테스트 이미지 3669장에 대해 실험, 사람 5명이 별도 참고 자료 없이 각 이미지의 품종을 맞히게 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&amp;ldquo;majority vote&amp;rdquo; : 여러 사람이 합의하여 정답을 고른것 (집단지성)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;사람 평균 54%, CLIP 93%&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;one-shot, two-shot human의 경우 성능이 크게 올랐으나 여전히 CLIP보다 낮은 성능.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;사람은 한두개의 예시만으로 성능 향상이 가능하지만 CLIP은 성능 향상에 제한적 &amp;rArr; 소량의 예시로 빠르게 feature를 학습하는 능력이 CLIP에는 부족함을 의미함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;background-color: #c1bef9; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;4. Limitations&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;일반 객체 분류에는 강하지만, 세밀한 분류(비슷한 새 종 구분, 자동차 모델 식별)나 구체적 속성 예측(사진 속 사람 나이 추정 등)에서는 다른 모델보다 성능이 낮은 문제 &amp;rArr; 학습 데이터에 편향된 범위 내에서 작동함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;모델 설계 및 학습상의 제약: 새로운 문장을 만들어내는건 불가능&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;성별, 인종, 직업에 관한 사회적 편향을 그대로 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>Paper Review/LLM &amp;amp; VLM</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/186</guid>
      <comments>https://ga02-ailab.tistory.com/186#entry186comment</comments>
      <pubDate>Sun, 31 Aug 2025 22:04:03 +0900</pubDate>
    </item>
    <item>
      <title>[Pytorch] Running FLUX.1-Kontext-dev and Qwen-Image-Edit on rtx3090</title>
      <link>https://ga02-ailab.tistory.com/185</link>
      <description>&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cdDMLF/btsPZbK8syV/J4KewBk0hnJgruiMup3ep0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cdDMLF/btsPZbK8syV/J4KewBk0hnJgruiMup3ep0/img.png&quot; data-origin-width=&quot;1418&quot; data-origin-height=&quot;1000&quot; data-is-animation=&quot;false&quot; style=&quot;width: 46.2721%; margin-right: 10px;&quot; data-widthpercent=&quot;46.82&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cdDMLF/btsPZbK8syV/J4KewBk0hnJgruiMup3ep0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcdDMLF%2FbtsPZbK8syV%2FJ4KewBk0hnJgruiMup3ep0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1418&quot; height=&quot;1000&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/4usNW/btsPZYqSF9i/K0qnJTObT7MJigPr5nIl0K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/4usNW/btsPZYqSF9i/K0qnJTObT7MJigPr5nIl0K/img.png&quot; data-origin-width=&quot;1366&quot; data-origin-height=&quot;848&quot; data-is-animation=&quot;false&quot; width=&quot;477&quot; height=&quot;296&quot; data-widthpercent=&quot;53.18&quot; style=&quot;width: 52.5651%;&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/4usNW/btsPZYqSF9i/K0qnJTObT7MJigPr5nIl0K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F4usNW%2FbtsPZYqSF9i%2FK0qnJTObT7MJigPr5nIl0K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1366&quot; height=&quot;848&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
  &lt;figcaption&gt;FLUX.1-Kontext-dev(왼쪽) &amp;nbsp;Qwen-Image-Edit(오른쪽)&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #8a3db6;&quot;&gt;&lt;b&gt;FLUX.1-Kontext-dev&lt;/b&gt;&lt;/span&gt; 와 &lt;span style=&quot;color: #8a3db6;&quot;&gt;&lt;b&gt;Qwen-Image-Edit&lt;/b&gt;&lt;/span&gt; 모두 이미지 생성 및 편집에 강력한 성능을 가지고 있는 모델입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;그만큼 모델 크기도 굉장히 큽니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;FLUX.1-Kontext-dev는 약 120억개,&amp;nbsp; &amp;nbsp;Qwen-Image-Edit은 약 200억개에 달하는 파라미터 수를 가지고 있는데요. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;그래서 rtx3090 GPU에서 실행하기에는 메모리가 많이 모자랍니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;pre id=&quot;code_1755673943735&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import os
from PIL import Image
import torch

from diffusers import QwenImageEditPipeline

pipeline = QwenImageEditPipeline.from_pretrained(&quot;Qwen/Qwen-Image-Edit&quot;)
print(&quot;pipeline loaded&quot;)
pipeline.to(torch.bfloat16)
pipeline.to(&quot;cuda&quot;)
pipeline.set_progress_bar_config(disable=None)
image = Image.open(&quot;./input.png&quot;).convert(&quot;RGB&quot;)
prompt = &quot;Change the rabbit's color to purple, with a flash light background.&quot;
inputs = {
    &quot;image&quot;: image,
    &quot;prompt&quot;: prompt,
    &quot;generator&quot;: torch.manual_seed(0),
    &quot;true_cfg_scale&quot;: 4.0,
    &quot;negative_prompt&quot;: &quot; &quot;,
    &quot;num_inference_steps&quot;: 50,
}

with torch.inference_mode():
    output = pipeline(**inputs)
    output_image = output.images[0]
    output_image.save(&quot;output_image_edit.png&quot;)
    print(&quot;image saved at&quot;, os.path.abspath(&quot;output_image_edit.png&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 코드는 Qwen-Image-Edit의 official huggingface에서 제공하는 코드입니다. 보통 diffusion 계열 모델들의 실행방법은 위 코드와 같죠.(pipeline만 다르게 선언해주면 됨.) &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그러나 위 코드로는 당연히 rtx3090에서 실행되지 않아요.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;rtx3090에서 실행하려면 아래 코드를 이용하면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1755674177188&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import torch
from PIL import Image
from diffusers import AutoModel, DiffusionPipeline, TorchAoConfig

model_id = &quot;/models/Qwen-Image-Edit&quot;
torch_dtype = torch.bfloat16

# TorchAO int8 weight-only on transformer
quantization_config = TorchAoConfig(&quot;int8wo&quot;)

transformer = AutoModel.from_pretrained(
    model_id,
    subfolder=&quot;transformer&quot;,
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
)
pipe = DiffusionPipeline.from_pretrained(
    model_id, 
    transformer=transformer, 
    torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()


# optional LoRA (works with or without)
pipe.load_lora_weights(&quot;/models/Qwen-Image-Lightning-8steps-V1.1.safetensors&quot;)


prompt = &quot;change the background to cafe.&quot;


generator = torch.Generator(device=&quot;cuda&quot;).manual_seed(42)
image = Image.open(&quot;./input.jpeg&quot;).convert(&quot;RGB&quot;)



# use 8 (or 4) steps if you're using the Lightning LoRA
image = pipe(
    image=image,
    prompt=prompt,
    num_inference_steps=8,
    generator=generator,
).images[0]

image.save(&quot;result.png&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;QwenImageEditPipeline를 사용하지 않고 AutoModel로 따로&amp;nbsp; transformer를 선언하여 로드해고 &lt;b&gt;&lt;span style=&quot;color: #ee2323;&quot;&gt;TorchAO&lt;/span&gt;&lt;/b&gt;라는걸 사용하면 됩니다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;(TorchAO는 PyTorch 2.3부터 새롭게 도입된 라이브러리로 &lt;b&gt;모듈 단위 최적화(quantization, fusion, pruning 등)&lt;/b&gt; 를 제공하는 패키지입니다.)&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;또한, Qwen-Image-Edit 의 처리 속도를 빠르게 해줄 LoRA 모델은 아래 링크에서 다운받으시면 돼요.&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1755674336935&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;lightx2v/Qwen-Image-Lightning &amp;middot; Hugging Face&quot; data-og-description=&quot;Please refer to Qwen-Image-Lightning github to learn how to use the models. use with diffusers  : make sure to install diffusers from main (pip install git+https://github.com/huggingface/diffusers.git) from diffusers import DiffusionPipeline, FlowMatchE&quot; data-og-host=&quot;huggingface.co&quot; data-og-source-url=&quot;https://huggingface.co/lightx2v/Qwen-Image-Lightning&quot; data-og-url=&quot;https://huggingface.co/lightx2v/Qwen-Image-Lightning&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cH3RAF/hyZycG81yD/P9NduKVI5SO4KZZwulymQK/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648,https://scrap.kakaocdn.net/dn/PAgAh/hyZDRA0Eo5/MofXv2zBigz6gdcQPgwzb1/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648&quot;&gt;&lt;a href=&quot;https://huggingface.co/lightx2v/Qwen-Image-Lightning&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://huggingface.co/lightx2v/Qwen-Image-Lightning&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cH3RAF/hyZycG81yD/P9NduKVI5SO4KZZwulymQK/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648,https://scrap.kakaocdn.net/dn/PAgAh/hyZDRA0Eo5/MofXv2zBigz6gdcQPgwzb1/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;lightx2v/Qwen-Image-Lightning &amp;middot; Hugging Face&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Please refer to Qwen-Image-Lightning github to learn how to use the models. use with diffusers  : make sure to install diffusers from main (pip install git+https://github.com/huggingface/diffusers.git) from diffusers import DiffusionPipeline, FlowMatchE&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;huggingface.co&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이렇게 하면 메모리는 &lt;b&gt;약 23GB&lt;/b&gt;정도 사용됩니다. (일반적인 코드로는&amp;nbsp; A6000 GPU에서 enable_model_cpu_offload() 사용하여 실행했을 때 약 42GB 사용)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;실험결과 아웃풋에도 차이가 없습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #006dd7;&quot;&gt;&lt;b&gt;*FLUX.1-Kontext-dev도 위 코드에서 모델 경로만 바꿔 실행해주면 됩니다.*&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;[참고자료]&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://huggingface.co/Qwen/Qwen-Image-Edit/discussions/6&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://huggingface.co/Qwen/Qwen-Image-Edit/discussions/6&lt;/a&gt;&lt;/p&gt;</description>
      <category>Pytorch</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/185</guid>
      <comments>https://ga02-ailab.tistory.com/185#entry185comment</comments>
      <pubDate>Wed, 20 Aug 2025 16:34:44 +0900</pubDate>
    </item>
    <item>
      <title>[12] In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer</title>
      <link>https://ga02-ailab.tistory.com/184</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper] &lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/pdf/2504.20690&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2504.20690&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a style=&quot;color: #000000;&quot; href=&quot;https://github.com/River-Zhang/ICEdit.git&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/River-Zhang/ICEdit.git&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1751196080554&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - River-Zhang/ICEdit: Image editing is worth a single LoRA! 0.1% training data for fantastic image editing! Training rele&quot; data-og-description=&quot;Image editing is worth a single LoRA! 0.1% training data for fantastic image editing! Training released! Surpasses GPT-4o in ID persistence! Official ComfyUI workflow release! Only 4GB VRAM is enou...&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/River-Zhang/ICEdit.git&quot; data-og-url=&quot;https://github.com/River-Zhang/ICEdit&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/FytLC/hyZcjUbw4o/fge0MxLkRkmLz2Ib6kC6u0/img.png?width=1200&amp;amp;height=600&amp;amp;face=995_120_1049_180,https://scrap.kakaocdn.net/dn/GBXmE/hyZbwlNQkW/sTMpSDeG44E8WlNyhbnA9K/img.png?width=1200&amp;amp;height=600&amp;amp;face=995_120_1049_180&quot;&gt;&lt;a href=&quot;https://github.com/River-Zhang/ICEdit.git&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/River-Zhang/ICEdit.git&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/FytLC/hyZcjUbw4o/fge0MxLkRkmLz2Ib6kC6u0/img.png?width=1200&amp;amp;height=600&amp;amp;face=995_120_1049_180,https://scrap.kakaocdn.net/dn/GBXmE/hyZbwlNQkW/sTMpSDeG44E8WlNyhbnA9K/img.png?width=1200&amp;amp;height=600&amp;amp;face=995_120_1049_180');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - River-Zhang/ICEdit: Image editing is worth a single LoRA! 0.1% training data for fantastic image editing! Training rele&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Image editing is worth a single LoRA! 0.1% training data for fantastic image editing! Training released! Surpasses GPT-4o in ID persistence! Official ComfyUI workflow release! Only 4GB VRAM is enou...&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc9af;&quot;&gt;&lt;b&gt;1. Abstract&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Instruction-based image editing 이란?&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;사용자가 자연어 prompt로 이미지를 원하는대로 수정할 수 있도록 하는 것&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;현재는 precision과 efficiency사이에 trade-off가 있음.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;현재까지는 두 가지의 접근 방법 존재&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Fine-tuning: 퀄리티가 좋지만 상당한 계산량과 막대한 데이터셋 필요하다.&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;training-free: 비교적 빠르지만 prompt를 정확히 이해하고 수행하는데 어려움이 있다.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 이러한 점들을 개선한 large scale Diffusion Transformer 를 활용하여 최소한의 데이터와 파라미터 수정만으로도 prompt 기반 이미지 편집을 정확하고 효율적으로 해내는 방법을 제시한다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc9af;&quot;&gt;&lt;b&gt; 2. Introduction&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;817&quot; data-origin-height=&quot;589&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/pCu05/btsOWFMD5s0/xzYHlYEfmKVdlF2sZob6RK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/pCu05/btsOWFMD5s0/xzYHlYEfmKVdlF2sZob6RK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/pCu05/btsOWFMD5s0/xzYHlYEfmKVdlF2sZob6RK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FpCu05%2FbtsOWFMD5s0%2FxzYHlYEfmKVdlF2sZob6RK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;817&quot; height=&quot;589&quot; data-origin-width=&quot;817&quot; data-origin-height=&quot;589&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Instruction-based image editing 의 위 두 가지 접근 방법의 단점을 해결하기 위해 최근에는 Diffusion Transformer 사용&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Scalable Generation Fidelity: auxiliary module 없이 reference-guided synthesis과 identity preserved editing이 가능 &amp;rArr; 매우 강한 생성력을 가짐 (ex. Flux)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Intrinsic Contextual Awareness: 원본과 생성될 이미지 사이의 bidirectional interactions 가능 &amp;rArr; source와 target 이미지 동시처리 가능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; &lt;b&gt;과연 DiT의 강력한 생성력과 문맥 인식 능력만으로, 구조 변경이나 대규모 학습 없이도 Instruction-based image editing의 precision과 efficiency을 모두 달성할 수 있을까?&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;두가지의 중요한 통찰&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;In-context prompts 고안&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;원본 이미지(좌측)와 편집된 이미지(우측)의 쌍을 묘사하는 문장을 prompt로 구성하여, 한 번의 생성으로 왼쪽에는 원본에 해당하는 그림을, 오른쪽에는 편집된 그림을 동시에 생성시키는 prompt&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;추가적인 모델 구조 변경이나 tuning 없이도 원본 이미지의 특징을 유지하면서 prompt를 실행&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;초기 noise가 생성 결과에 큰 영향을 미침&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특정 prompt에 잘 맞는 seed값이 있음을 알게 됨&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;precision과 efficiency를 동시에 향상 시키는 전략&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LoRA-MoE Hybrid Tuning:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;대규모 DiT 모델의 일부 모듈만 fine-tuning하고 다양한 prompt(remove, add,&amp;hellip;)에 대응하기 위해 LoRA와 MoE(Mixture of Expert) 구조를 결합한 것&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;태스크별 전문화된 경로를 모델 내에 두고, 입력 내용에 따라 동적으로 적합한 expert 경로가 활성화되도록 함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Early Filter Inference Time Scaling&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;VLM을 평가자로 활용하여 생성 초반 단계에서 여러 noise seed들에 대한 후보 결과물을 비교 한 뒤 prompt에 가장 잘 부합하는 seed를 고르는 방법&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;전체 생성 과정을 모두 거치지 않고도 초반 몇 스텝의 결과로 prompt 수행 여부를 빠르게 판별하고 최적 seed만 최종 생성에 사용하기 때문에, &lt;b&gt;precision과 efficiency&lt;/b&gt;을 동시에 얻을 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc9af;&quot;&gt;&lt;b&gt; 3. Method&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #0593d3; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.1. Exploration of DiT&amp;rsquo;s In-context Edit Ability&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;In-Context Generation with Edit Instructions&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특수한 형태의 prompt 설계&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;A diptych with two side-by-side images of the same scene. On the right, the scene is the same as on the left but {prompt}. (&amp;ldquo;두 장의 나란한 이미지가 있습니다. 오른쪽에는 왼쪽 이미지와 같지만 거기에 {prompt}가 적용되어 있습니다.&amp;rdquo;)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;왼쪽에는 원본 이미지가 주어지고,&amp;nbsp;오른쪽은 모델이 생성해야 할 편집된 이미지가 주어짐.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;하나의 prompt로 &quot;원본과 편집본&quot;을&amp;nbsp;동시에 인식 가능 (어떤 부분을 유지하고 무엇을 바꿔야 하는지 등)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;실제로 편집해야 할 부분이 활성화 되어 있는 것을 attention map을 통해 확인 가능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;719&quot; data-origin-height=&quot;390&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/naeWp/btsOXOCkBxD/OmaeVbL8hqP4fT3MwpCsIK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/naeWp/btsOXOCkBxD/OmaeVbL8hqP4fT3MwpCsIK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/naeWp/btsOXOCkBxD/OmaeVbL8hqP4fT3MwpCsIK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnaeWp%2FbtsOXOCkBxD%2FOmaeVbL8hqP4fT3MwpCsIK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;719&quot; height=&quot;390&quot; data-origin-width=&quot;719&quot; data-origin-height=&quot;390&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;In-Context Edit Framework&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;두 가지 구조의 실험 (T2I DiT vs Inpainting DiT)&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Text-to-Image DiT 기반&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;원본 이미지를 DiT가 받아들일 수 있게 역변환(Inverted Noise)을 수행한 뒤, 이를 위 프롬프트의 좌측 토큰에 삽입하여 이미지 생성에 활용하고 우측은 편집 지시에 따라 생성하는 방법.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;inpainting DiT&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;입력 캔버스를 미리 왼쪽 절반은 원본 이미지, 오른쪽 절반은 빈 마스크로 채운 후, 동일한 위 프롬프트를 넣어 오른쪽 마스크 부분만 채우도록 하는 방법&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imagegridblock&quot;&gt;
  &lt;div class=&quot;image-container&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bIMgZR/btsOYffifmK/e6mpzM4N6Xpn2VRWZJjK4K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bIMgZR/btsOYffifmK/e6mpzM4N6Xpn2VRWZJjK4K/img.png&quot; data-origin-width=&quot;628&quot; data-origin-height=&quot;411&quot; data-is-animation=&quot;false&quot; style=&quot;width: 36.4099%; margin-right: 10px;&quot; data-widthpercent=&quot;36.84&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bIMgZR/btsOYffifmK/e6mpzM4N6Xpn2VRWZJjK4K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbIMgZR%2FbtsOYffifmK%2Fe6mpzM4N6Xpn2VRWZJjK4K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;628&quot; height=&quot;411&quot;/&gt;&lt;/span&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lzaD9/btsOXhkr5ZB/TptWAhSSkm6Pcly6WjBuwK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lzaD9/btsOXhkr5ZB/TptWAhSSkm6Pcly6WjBuwK/img.png&quot; data-origin-width=&quot;634&quot; data-origin-height=&quot;242&quot; data-is-animation=&quot;false&quot; style=&quot;width: 62.4274%;&quot; data-widthpercent=&quot;63.16&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lzaD9/btsOXhkr5ZB/TptWAhSSkm6Pcly6WjBuwK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlzaD9%2FbtsOXhkr5ZB%2FTptWAhSSkm6Pcly6WjBuwK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;634&quot; height=&quot;242&quot;/&gt;&lt;/span&gt;&lt;/div&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #0593d3; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.2. LoRA-MoE Hybrid Fine-tuning&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;LoRA Tuning&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;데이터셋 (약 5만 장)을 구축하여, DiT 모델 일부의 LoRA fine-tuning 수행 &amp;rArr; 적은 양의 데이터로도 편집 성공률과 화질이 크게 향상 됨&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그러나, 하나의 LoRA 모듈만으로 모든 편집 작업(add, removal, modification,&amp;hellip;)을 향상시키는 데에는 한계가 존재&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;add와 modification은 성격이 다르니 다른 조작이 필요한데 단일 LoRA로 다루기에는 특정 작업에서 높은 실패율을 보임 &amp;rArr; MoE(Mixture of Experts) 접목&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;Mixture of LoRAs&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;MoE 구조는 여러 expert 신경망을 병렬로 두고 입력 특성에 따라 적절한 expert를 선택해 활용함으로써, 다양한 입력 패턴에 대응하여 처리할 수 있도록 함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;prompt 내용에 따라 필요한 expert만 활성화 되므로 매우 효율적&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;DiT의 multi-modal attention block 내 output layer에 다수의 LoRA 모듈을 병렬로 삽입하여 복수의 expert LoRA들을 둠(출력 부근에서만 여러 LoRA 경로를 제공)&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;routing classifier&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;예를 들어) 4개의 LoRA expert가 있다면, routing module이 각 토큰마다 4개 expert의 중요도를 예측하고 그 중 상위 1개만 실제 사용(Top-K=1)하는 sparse MoE 방식을 취함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 다양한 편집 작업에 대한 모델 수용 능력을 높이면서 효율은 그대로 유지&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #0593d3; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;3.3. Early Filter Inference Time Scaling&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;inference 과정 자체를 개선하여&amp;nbsp;편집 quality를 높이는 기술&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;초기 노이즈의 상태가 최종 결과에 큰 영향을 주게됨&amp;rArr;이 초기 단계를 똑똑하게 관리하는 것을 목표로 함&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;VLM을 평가자로 이용 (본 논문에서는 CLIP 이용)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;작동 방식&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다양한 seed로 부터 이미지를 생성해봄&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;초기 몇 step의 노이즈 제거가 진행된 이미지들을 CLIP으로 평가&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;가장 유망한 후보를 골라 남기고&amp;nbsp;나머지 후보들은 버리거나 스케일을 조정&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;처음부터 올바른 방향으로 시작할 수 있어 효율적&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;548&quot; data-origin-height=&quot;502&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dMA8fg/btsOWqoES1W/MsgwsW6BqtAeJ6cKBk5T60/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dMA8fg/btsOWqoES1W/MsgwsW6BqtAeJ6cKBk5T60/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dMA8fg/btsOWqoES1W/MsgwsW6BqtAeJ6cKBk5T60/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdMA8fg%2FbtsOWqoES1W%2FMsgwsW6BqtAeJ6cKBk5T60%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;548&quot; height=&quot;502&quot; data-origin-width=&quot;548&quot; data-origin-height=&quot;502&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc9af;&quot;&gt;&lt;b&gt;4. Experiments&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;Implementation Details&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;FLUX.1 Fill을 backbone으로 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LoRA fine-tuning 데이터 셋&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;MagicBrush 데이터셋(약 9천 쌍의 편집 샘플)에 OmniEdit 데이터셋의 일부(약 4만 샘플) &amp;rArr; 총 약 5만 장&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;Evaluation Settings&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;성능평가 데이터셋&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Emu Edit: GT가 제공되므로, CLIP 점수, DINO 점수, L1 loss 등을 통해 편집된 아웃풋과 GT 간의 유사도를 정량 평가&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;MagicBrush: GT 없음 &amp;rArr; GPT-4 기반 평가를 도입&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;GPT-4에게 원본과 출력, prompt를 보여주고 편집이 제대로 이루어졌는지 점수화하는 방법&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;추가로 VIE-Score 사용: SC(score compliance)와 PQ(perceptual quality) 두 부분으로 구성&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;SC는 prompt 수행 여부와 편집되지 않아야 하는 부분의 보존 정도를, PQ는 이미지 자체의 품질을 의미함&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;4.1. Comparisons with State-of-the-Art&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;MagicBrush &amp;amp; Emu Edit dataset 결과&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;훨씬 적은 파라미터로 기존 모델과 유사하거나 높은 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1082&quot; data-origin-height=&quot;477&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Em27Y/btsOVcY0F5G/YdIakUDh7KarK9H25przdk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Em27Y/btsOVcY0F5G/YdIakUDh7KarK9H25przdk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Em27Y/btsOVcY0F5G/YdIakUDh7KarK9H25przdk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FEm27Y%2FbtsOVcY0F5G%2FYdIakUDh7KarK9H25przdk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1082&quot; height=&quot;477&quot; data-origin-width=&quot;1082&quot; data-origin-height=&quot;477&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;DINO score는 이미지 간의 시각적 유사성(visual similarity)을 평가하기 위해 사용하는 지표&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;572&quot; data-origin-height=&quot;597&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bsukeA/btsOVeWPREM/E1CDHKqvmZRYdpHX9WimVk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bsukeA/btsOVeWPREM/E1CDHKqvmZRYdpHX9WimVk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bsukeA/btsOVeWPREM/E1CDHKqvmZRYdpHX9WimVk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbsukeA%2FbtsOVeWPREM%2FE1CDHKqvmZRYdpHX9WimVk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;572&quot; height=&quot;597&quot; data-origin-width=&quot;572&quot; data-origin-height=&quot;597&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1207&quot; data-origin-height=&quot;682&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/1JpcO/btsOWgsQaoe/MQOo2qFCrzze6BpSHdmGHK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/1JpcO/btsOWgsQaoe/MQOo2qFCrzze6BpSHdmGHK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/1JpcO/btsOWgsQaoe/MQOo2qFCrzze6BpSHdmGHK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F1JpcO%2FbtsOWgsQaoe%2FMQOo2qFCrzze6BpSHdmGHK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1207&quot; height=&quot;682&quot; data-origin-width=&quot;1207&quot; data-origin-height=&quot;682&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;752&quot; data-origin-height=&quot;761&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cbaX7s/btsOWqCaOPc/E3nccTKy4Gf3CKGUHjrrck/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cbaX7s/btsOWqCaOPc/E3nccTKy4Gf3CKGUHjrrck/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cbaX7s/btsOWqCaOPc/E3nccTKy4Gf3CKGUHjrrck/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcbaX7s%2FbtsOWqCaOPc%2FE3nccTKy4Gf3CKGUHjrrck%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;752&quot; height=&quot;761&quot; data-origin-width=&quot;752&quot; data-origin-height=&quot;761&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>Paper Review/etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/184</guid>
      <comments>https://ga02-ailab.tistory.com/184#entry184comment</comments>
      <pubDate>Sun, 29 Jun 2025 20:30:16 +0900</pubDate>
    </item>
    <item>
      <title>[11] Visual Instruction Tuning (LLaVA: Large Language and Vision Assistant)</title>
      <link>https://ga02-ailab.tistory.com/183</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[paper] &lt;a style=&quot;color: #000000;&quot; href=&quot;https://arxiv.org/pdf/2304.08485&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2304.08485&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a style=&quot;color: #000000;&quot; href=&quot;https://github.com/haotian-liu/LLaVA&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/haotian-liu/LLaVA&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1746967428009&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - haotian-liu/LLaVA: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyo&quot; data-og-description=&quot;[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. - haotian-liu/LLaVA&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/haotian-liu/LLaVA&quot; data-og-url=&quot;https://github.com/haotian-liu/LLaVA&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/vMwHd/hyYRoudteR/MaCODz5kkPdhhI0UOVQdBK/img.png?width=1200&amp;amp;height=600&amp;amp;face=984_148_1051_221,https://scrap.kakaocdn.net/dn/bHIlZL/hyYTb1BYN4/61KJ2GKv7Y0EWqjxjwPw60/img.png?width=1200&amp;amp;height=600&amp;amp;face=984_148_1051_221&quot;&gt;&lt;a href=&quot;https://github.com/haotian-liu/LLaVA&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/haotian-liu/LLaVA&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/vMwHd/hyYRoudteR/MaCODz5kkPdhhI0UOVQdBK/img.png?width=1200&amp;amp;height=600&amp;amp;face=984_148_1051_221,https://scrap.kakaocdn.net/dn/bHIlZL/hyYTb1BYN4/61KJ2GKv7Y0EWqjxjwPw60/img.png?width=1200&amp;amp;height=600&amp;amp;face=984_148_1051_221');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - haotian-liu/LLaVA: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyo&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. - haotian-liu/LLaVA&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;Abstract&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존 LLM의 문제점: 이미지를 입력 받지 못해 vision 정보를 처리하는데 어려움 &amp;rArr; multi-modal 연구 부족&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;GPT-4를 사용해 multi-modal language-image instruction-following 데이터를 생성하는 방법 최초 제시&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LLaVA: Large Language and Vision Assistant&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LLaVA: end- to end 학습, vision 인코더와 LLM 모델을 연결&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이를 이용한 벤치마크 데이터셋 구축&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;Introduction&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존의 multi-modal vision-and-language instructions&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존에도 classification, detection, segmentation, captioning 등에 language를 이용하긴 했지만, 단순히 이미지를 설명하는데 그침.&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;매우 제한적&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;아래 예시처럼, user의 instruction은 입력 받을 수 없기 때문에, 이미지에 대해 대화하는 건 불가능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;638&quot; data-origin-height=&quot;515&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bjPsgH/btsNTyUUvEI/zjapUo1dFAjo67QU61jgS0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bjPsgH/btsNTyUUvEI/zjapUo1dFAjo67QU61jgS0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bjPsgH/btsNTyUUvEI/zjapUo1dFAjo67QU61jgS0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbjPsgH%2FbtsNTyUUvEI%2FzjapUo1dFAjo67QU61jgS0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;462&quot; height=&quot;373&quot; data-origin-width=&quot;638&quot; data-origin-height=&quot;515&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;main contribution&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Multi-modal instruction-following data&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;image-text pair 데이터의 부족 문제 해결을 위해 ChatGPT와 GPT4를 이용해 instruction following 형식으로 변환하는 파이프라인 제안&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Large multi-modal models&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;CLIP의 visual encoder와 LLaMA를 연결하여 생성한 vision-language 데이터를 end-to-end로 fine tuning 하는 LLM 제안.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Multimodal instruction-following benchmark&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LLaVA-Bench 제안&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;GPT-assisted Visual Instruction Data Generation&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존 multi-modal 분야에서 쓰던 데이터인 CC, LAION은 instruction-following 형식이 아님&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;따라서, 이 데이터들을 chatGPT와 GPT4를 이용해 instruction-following 형식으로 만듦&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특정 prompt 를 chatGPT와 GPT4의 input으로 사용.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. 이미지는 input으로 사용하지 않고 이미지 caption과 bbox(COCO dataset에 라벨링되어 있는 값 사용)값들만 이용해 질문 및 대화 셋 생성 (symbolic representation) &amp;rArr; LLM이 인식 가능한 시퀀스로 인코딩 가능&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;717&quot; data-origin-height=&quot;231&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cC4fEw/btsNR3htZ4H/KOM7nXckLuDTkfOnXwDlh0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cC4fEw/btsNR3htZ4H/KOM7nXckLuDTkfOnXwDlh0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cC4fEw/btsNR3htZ4H/KOM7nXckLuDTkfOnXwDlh0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcC4fEw%2FbtsNR3htZ4H%2FKOM7nXckLuDTkfOnXwDlh0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;717&quot; height=&quot;231&quot; data-origin-width=&quot;717&quot; data-origin-height=&quot;231&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 이 정보들은 2의 context 자리에 들어가게 됨&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. COCO 이미지를 아래 prompt를 이용해 3가지 유형의 instruction-following 데이터 설계&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;623&quot; data-origin-height=&quot;430&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bUdbJg/btsNStteYK0/2Z8sGkGeEo06NSZa46q0Zk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bUdbJg/btsNStteYK0/2Z8sGkGeEo06NSZa46q0Zk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bUdbJg/btsNStteYK0/2Z8sGkGeEo06NSZa46q0Zk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbUdbJg%2FbtsNStteYK0%2F2Z8sGkGeEo06NSZa46q0Zk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;623&quot; height=&quot;430&quot; data-origin-width=&quot;623&quot; data-origin-height=&quot;430&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. conversation, detailed description, complex reasoning 총 3가지 유형의 데이터 생성&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;620&quot; data-origin-height=&quot;476&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/MPZoZ/btsNTFztUlr/88pS08PVBDsIrkVA47K6XK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/MPZoZ/btsNTFztUlr/88pS08PVBDsIrkVA47K6XK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/MPZoZ/btsNTFztUlr/88pS08PVBDsIrkVA47K6XK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FMPZoZ%2FbtsNTFztUlr%2F88pS08PVBDsIrkVA47K6XK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;620&quot; height=&quot;476&quot; data-origin-width=&quot;620&quot; data-origin-height=&quot;476&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;conversation &amp;rArr; assistant-human이 대화하는 형태, 이미지만 보고 알 수 있는 것에 대한 QA 포함, 객체 종류/개수/위치/동작/상대적 위치 등의 시각적 요소 자체에 대한 질문들&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;detailed description &amp;rArr; 이미지에 대한 상세한 설명을 생성&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;complex reasoning &amp;rArr; 심층적인 추론을 하는 QA를 생성, 엄격한 설명이 포함된 응답을 구체적인 이유를 포함해 생성하도록 요구&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4. 총 15만개의 데이터셋 생성&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;430&quot; data-origin-height=&quot;119&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ykCqG/btsNRKJekRw/dl0gl2lXoKzJOFIWG1Czyk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ykCqG/btsNRKJekRw/dl0gl2lXoKzJOFIWG1Czyk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ykCqG/btsNRKJekRw/dl0gl2lXoKzJOFIWG1Czyk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FykCqG%2FbtsNRKJekRw%2Fdl0gl2lXoKzJOFIWG1Czyk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;430&quot; height=&quot;119&quot; data-origin-width=&quot;430&quot; data-origin-height=&quot;119&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;Visual Instruction Tuning&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;[Architecture]&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;876&quot; data-origin-height=&quot;342&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/yD4IN/btsNSagNKu9/nx8VuL37kCeqaHeQETTGE0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/yD4IN/btsNSagNKu9/nx8VuL37kCeqaHeQETTGE0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/yD4IN/btsNSagNKu9/nx8VuL37kCeqaHeQETTGE0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FyD4IN%2FbtsNSagNKu9%2Fnx8VuL37kCeqaHeQETTGE0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;876&quot; height=&quot;342&quot; data-origin-width=&quot;876&quot; data-origin-height=&quot;342&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;LLM: Vicuna 사용 (당시 모델 중 instructio-following 분야에서 가장 성능이 좋았던 모델)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;vision encoder: pretrained CLIP visual encoder인 ViT-L/14 사용(이미지를 visual feature화 하는데 사용)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;*추가 된 부분: image feature를 word embedding space로 연결하는 linear layer&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt; &amp;rArr; 굉장히 lightweight 하고 데이터 중심의 실험을 빠르게 반복 할 수 있는 cost-effective한 scheme&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;[Training]&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;decimal&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;각 이미지에 대해 conversation data 생성&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;212&quot; data-origin-height=&quot;41&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Lbfvq/btsNTz0za4M/2Ue1XG4fii4OKY78V5kel1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Lbfvq/btsNTz0za4M/2Ue1XG4fii4OKY78V5kel1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Lbfvq/btsNTz0za4M/2Ue1XG4fii4OKY78V5kel1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FLbfvq%2FbtsNTz0za4M%2F2Ue1XG4fii4OKY78V5kel1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;212&quot; height=&quot;41&quot; data-origin-width=&quot;212&quot; data-origin-height=&quot;41&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;각 answer를 assistant의 답변으로 간주&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;t 번째 instruction은 아래와 같이 설정&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;717&quot; data-origin-height=&quot;63&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/BmF97/btsNTlnRlIJ/MfzKS3OoXbJrvJw8QUIJlk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/BmF97/btsNTlnRlIJ/MfzKS3OoXbJrvJw8QUIJlk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/BmF97/btsNTlnRlIJ/MfzKS3OoXbJrvJw8QUIJlk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBmF97%2FbtsNTlnRlIJ%2FMfzKS3OoXbJrvJw8QUIJlk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;717&quot; height=&quot;63&quot; data-origin-width=&quot;717&quot; data-origin-height=&quot;63&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;일관된 형태로 형성 가능&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. instruction tuning 수행&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;auto regressive 사용&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;앞에 나온 단어를 보고 다음 단어를 맞추는 방식&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존 모델들과 달리 image feature를 함꼐 사용한다는 점에서 차이가 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;464&quot; data-origin-height=&quot;60&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bvWWOj/btsNR9IXR4o/2nPUEGo916PjRRnxhB4Bak/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bvWWOj/btsNR9IXR4o/2nPUEGo916PjRRnxhB4Bak/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bvWWOj/btsNR9IXR4o/2nPUEGo916PjRRnxhB4Bak/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbvWWOj%2FbtsNR9IXR4o%2F2nPUEGo916PjRRnxhB4Bak%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;464&quot; height=&quot;60&quot; data-origin-width=&quot;464&quot; data-origin-height=&quot;60&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 시퀀스의 길이가 L 일때 정답 Xa에 대한 확률 ( X_instruct&amp;lt;i : 현재 예측 토큰인 Xi 이전 모든 경우에 대한 instruction tokens&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;X_a&amp;lt;i: 현재 예측 토큰인 Xi 이전 모든 경우에 대한 answer token)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. loss 계산&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;769&quot; data-origin-height=&quot;108&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c3irrS/btsNR5sQxcp/YV6dLBkzc5YQcawztNJN90/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c3irrS/btsNR5sQxcp/YV6dLBkzc5YQcawztNJN90/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c3irrS/btsNR5sQxcp/YV6dLBkzc5YQcawztNJN90/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc3irrS%2FbtsNR5sQxcp%2FYV6dLBkzc5YQcawztNJN90%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;769&quot; height=&quot;108&quot; data-origin-width=&quot;769&quot; data-origin-height=&quot;108&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;초록색 부분의 token을 예측&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;loss도 이 부분만 계산&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4. Fine tuning end-to-end&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;visual encoder 는 frozen, LLM과 projection layer만 학습&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;Experiments&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;pretrained parameter&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;GPU: A100*8&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Batch size: 8&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;epoch: 1&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;lr: 2e-3&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;fine tuning parameter&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;GPU: A100*8&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;batch size: 32&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;epoch: 3&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;lr: 2e-5&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1075&quot; data-origin-height=&quot;583&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bKU7KZ/btsNTnsr7B7/JQx4bu1MJbnpNmQxNa2k60/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bKU7KZ/btsNTnsr7B7/JQx4bu1MJbnpNmQxNa2k60/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bKU7KZ/btsNTnsr7B7/JQx4bu1MJbnpNmQxNa2k60/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbKU7KZ%2FbtsNTnsr7B7%2FJQx4bu1MJbnpNmQxNa2k60%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1075&quot; height=&quot;583&quot; data-origin-width=&quot;1075&quot; data-origin-height=&quot;583&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; LLaVA만이 움직이는 차안에서 다림질 하는 것이 이상함을 인지하고 답변함&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;525&quot; data-origin-height=&quot;493&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cHkyjz/btsNRPqia0g/nZRScYWqQLdoF8PR4n4n6K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cHkyjz/btsNRPqia0g/nZRScYWqQLdoF8PR4n4n6K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cHkyjz/btsNRPqia0g/nZRScYWqQLdoF8PR4n4n6K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcHkyjz%2FbtsNRPqia0g%2FnZRScYWqQLdoF8PR4n4n6K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;525&quot; height=&quot;493&quot; data-origin-width=&quot;525&quot; data-origin-height=&quot;493&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 치킨으로 세계지도를 만들었음을 이해함&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;712&quot; data-origin-height=&quot;488&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/HgBrW/btsNScecxVf/nxFCT5JIzKkdxJBktDaqeK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/HgBrW/btsNScecxVf/nxFCT5JIzKkdxJBktDaqeK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/HgBrW/btsNScecxVf/nxFCT5JIzKkdxJBktDaqeK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FHgBrW%2FbtsNScecxVf%2FnxFCT5JIzKkdxJBktDaqeK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;712&quot; height=&quot;488&quot; data-origin-width=&quot;712&quot; data-origin-height=&quot;488&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rArr; 유머러스하게 표현된 명작도 알아보고 설명가능&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;정량평가&lt;/b&gt;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;COCO dataset에서 랜덤하게 30장 뽑음&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;GPT4의 답변을 GT로 설정, GPT4로 부터 종합적인 결과를 점수로 제공받음 (평균+- 표준편차형식)&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;각 질문에 대해 사람 평가자가 &lt;b&gt;점수 또는 순위 기반으로&lt;/b&gt; 평가&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;모든 질문에서의 **평균(mean)과 표준편차(std)**를 산출&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;rarr; 이걸 모델별로 평균 내어 수치로 표기함&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다양한 질문에 대한 성능의 일관성을 보여주기 좋은 지표&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;표준편차가 작다는 건 모델 응답의 퀄리티가 안정적이라는 의미&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;학습되지 않은 데이터셋에서도 우수한 성능을 보임&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;complex reasoning 에서 매우 우수한 성능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;721&quot; data-origin-height=&quot;137&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ZuTDM/btsNScyubCP/QlpAVcPYqf79u18ZBxix2k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ZuTDM/btsNScyubCP/QlpAVcPYqf79u18ZBxix2k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ZuTDM/btsNScyubCP/QlpAVcPYqf79u18ZBxix2k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FZuTDM%2FbtsNScyubCP%2FQlpAVcPYqf79u18ZBxix2k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;721&quot; height=&quot;137&quot; data-origin-width=&quot;721&quot; data-origin-height=&quot;137&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>Paper Review/etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/183</guid>
      <comments>https://ga02-ailab.tistory.com/183#entry183comment</comments>
      <pubDate>Sun, 11 May 2025 21:54:51 +0900</pubDate>
    </item>
    <item>
      <title>[딥러닝 기본지식] batch size가 학습에 미치는 영향 / 적절한 batch size 선택하기</title>
      <link>https://ga02-ailab.tistory.com/182</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;-&lt;span&gt;&amp;nbsp;&lt;/span&gt;batch size가 학습에 미치는 영향&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;137&quot; data-end=&quot;171&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;batch size 의 값에 따라 학습 결과는 직접적인 영향을 받게 됩니다. 클때와 작을 때 각각의 장단점은 아래와 같습니다.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;137&quot; data-end=&quot;171&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;137&quot; data-end=&quot;171&quot;&gt;&lt;span style=&quot;color: #000000; background-color: #dddddd; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;✅&amp;nbsp; &amp;nbsp;배치 크기가 클 때 (예: 256~1024)&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;172&quot; data-end=&quot;183&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;✔&lt;span&gt;&amp;nbsp;&lt;/span&gt;장점:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;184&quot; data-end=&quot;318&quot;&gt;
&lt;li data-start=&quot;184&quot; data-end=&quot;218&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;병렬 연산이 최적화됨 &amp;rarr; GPU 활용도가 높아짐&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;219&quot; data-end=&quot;274&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;학습이 빠름&lt;span&gt;&amp;nbsp;&lt;/span&gt;(한 번의 forward/backward pass에서 많은 샘플을 처리)&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;275&quot; data-end=&quot;318&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Gradient가 안정적&lt;span&gt;&amp;nbsp;&lt;/span&gt;(많은 샘플을 평균내므로 변화가 작음)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;320&quot; data-end=&quot;331&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;❌&lt;span&gt;&amp;nbsp;&lt;/span&gt;단점:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;332&quot; data-end=&quot;496&quot;&gt;
&lt;li data-start=&quot;332&quot; data-end=&quot;396&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;일반화 성능이 낮아질 가능성&lt;span&gt;&amp;nbsp;&lt;/span&gt;(Gradient가 안정적이라 local minima에 빠질 위험 있음)&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;397&quot; data-end=&quot;451&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;메모리 사용량이 많음&lt;span&gt;&amp;nbsp;&lt;/span&gt;(큰 모델에서는 Out of Memory(OOM) 발생 가능)&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;452&quot; data-end=&quot;496&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;배치 내 데이터 다양성이 감소하여 Overfitting 위험 증가&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;503&quot; data-end=&quot;534&quot;&gt;&lt;span style=&quot;background-color: #dddddd; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;✅&amp;nbsp; &amp;nbsp;배치 크기가 작을 때 (예: 2~32)&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;535&quot; data-end=&quot;546&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;✔&lt;span&gt;&amp;nbsp;&lt;/span&gt;장점:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;547&quot; data-end=&quot;701&quot;&gt;
&lt;li data-start=&quot;547&quot; data-end=&quot;601&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;일반화 성능이 좋아질 가능성&lt;span&gt;&amp;nbsp;&lt;/span&gt;(Gradient가 noisy하여 다양한 패턴을 학습)&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;602&quot; data-end=&quot;635&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;메모리 사용량이 적음&lt;span&gt;&amp;nbsp;&lt;/span&gt;(큰 모델도 학습 가능)&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;636&quot; data-end=&quot;701&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;최적화가 더 효과적일 수 있음&lt;span&gt;&amp;nbsp;&lt;/span&gt;(SGD와 같은 Optimizer는 작은 배치에서 더 빠르게 학습 가능)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;703&quot; data-end=&quot;714&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;❌&lt;span&gt;&amp;nbsp;&lt;/span&gt;단점:&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;715&quot; data-end=&quot;841&quot;&gt;
&lt;li data-start=&quot;715&quot; data-end=&quot;757&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;학습이 느림&lt;span&gt;&amp;nbsp;&lt;/span&gt;(업데이트가 자주 발생하여 전체적인 효율이 낮음)&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;758&quot; data-end=&quot;796&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Gradient 변동성이 커서 학습이 불안정할 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;797&quot; data-end=&quot;841&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Batch Normalization이 제대로 동작하지 않을 가능성&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-style=&quot;style5&quot; data-ke-type=&quot;horizontalRule&quot; /&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;- 적절한 batch size를 선택하는 방법&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;1619&quot; data-end=&quot;1646&quot;&gt;&lt;span style=&quot;background-color: #dddddd; font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;✅&amp;nbsp;&lt;span&gt;&amp;nbsp;&lt;/span&gt;일반적인 추천 값&amp;nbsp; &amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;1647&quot; data-end=&quot;1784&quot;&gt;
&lt;li data-start=&quot;1647&quot; data-end=&quot;1681&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;경험적으로&lt;span&gt;&amp;nbsp;&lt;/span&gt;128~512 범위가 일반적으로 적절&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;1682&quot; data-end=&quot;1706&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;작은 데이터셋:&lt;span&gt;&amp;nbsp;&lt;/span&gt;16~64&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;1707&quot; data-end=&quot;1736&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;중간 크기 데이터셋:&lt;span&gt;&amp;nbsp;&lt;/span&gt;128~256&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;1737&quot; data-end=&quot;1784&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;대형 데이터셋 (ImageNet, LLaMA 등):&lt;span&gt;&amp;nbsp;&lt;/span&gt;512~1024&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;단! batch size가 큰 경우 &lt;b&gt;마지막 batch에 몇개의 데이터가 들어가는지에 따라 성능에 악영향&lt;/b&gt;을 미칠 수 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;예를들어, batch size가 300인데 마지막 batch에 6개의 데이터만 들어간다면 이는 아래와 같은 이유들로 성능에 악영향을 미치게 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;182&quot; data-end=&quot;223&quot;&gt;&lt;span style=&quot;background-color: #dddddd;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1️⃣&amp;nbsp; &amp;nbsp;Batch Normalization (BN) 문제&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;224&quot; data-end=&quot;431&quot;&gt;
&lt;li data-start=&quot;224&quot; data-end=&quot;284&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;배치 크기가 너무 작아지면&lt;span&gt;&amp;nbsp;&lt;/span&gt;Batch Normalization에서 평균/분산 추정이 불안정해짐&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;285&quot; data-end=&quot;371&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;보통 BN은&lt;span&gt;&amp;nbsp;&lt;/span&gt;batch_size가 16~32 이상일 때 안정적으로 동작하는데, 배치 크기가 6이면 통계 값이 제대로 계산되지 않을 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;372&quot; data-end=&quot;431&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그 결과,&lt;span&gt;&amp;nbsp;&lt;/span&gt;이전 배치들과 다른 스케일링이 적용되어 모델이 잘못된 업데이트를 수행할 가능성이 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot; data-start=&quot;808&quot; data-end=&quot;851&quot;&gt;&lt;span style=&quot;background-color: #dddddd;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2️⃣&amp;nbsp; &amp;nbsp;Optimizer의 Gradient Update 문제&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot; data-start=&quot;852&quot; data-end=&quot;1128&quot;&gt;
&lt;li data-start=&quot;852&quot; data-end=&quot;921&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;마지막 배치 크기가 6이므로, 이전 배치(300)보다&lt;span&gt;&amp;nbsp;&lt;/span&gt;Gradient Update 크기가 매우 작아질 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;922&quot; data-end=&quot;1051&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;만약 momentum이 높은 Optimizer (예: SGD with momentum=0.9)를 사용하면,&lt;span&gt;&amp;nbsp;&lt;/span&gt;이전 배치(300)의 큰 gradient가 유지되고 마지막 배치(6)의 작은 gradient는 거의 무시됨&lt;/span&gt;&lt;/li&gt;
&lt;li data-start=&quot;1052&quot; data-end=&quot;1128&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특히&lt;span&gt;&amp;nbsp;&lt;/span&gt;Adam, RMSProp 등의 Adaptive Optimizer는 배치 크기가 작아질수록 학습 속도가 불안정해질 수 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이 때 가장 간단한 해결책으로는&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;DataLoader에 drop_Last 옵션을 추가하여 마지막 작은 배치를 제거&lt;/b&gt;해주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1743408727156&quot; class=&quot;routeros&quot; style=&quot;background-color: #f8f8f8; color: #383a42;&quot; data-ke-type=&quot;codeblock&quot; data-ke-language=&quot;bash&quot;&gt;&lt;code&gt;dataloader = DataLoader(dataset, batch_size=300, shuffle=True, drop_last=True)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>AI Research/Deep Learning</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/182</guid>
      <comments>https://ga02-ailab.tistory.com/182#entry182comment</comments>
      <pubDate>Mon, 21 Apr 2025 09:53:08 +0900</pubDate>
    </item>
    <item>
      <title>[Pytorch] num_workers가 성능에 미치는 간접적 영향</title>
      <link>https://ga02-ailab.tistory.com/181</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;딥러닝으로 분류모델 학습도중, validation accuracy가&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;n epoch =&amp;gt;.&amp;nbsp; &quot;class A&quot; Acc: 98.21%,&amp;nbsp; &amp;nbsp;&quot;class B&quot; Acc: 98.67%&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;n+1 epoch =&amp;gt;&amp;nbsp; &amp;nbsp;&quot;class A&quot; Acc: 28.96%,&amp;nbsp; &amp;nbsp;&quot;class B&quot; Acc: 99.88%&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이렇게 데이터가 적은 쪽 클래스의 정확도가 대폭 감소하는 현상이 발생했습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;원인을 찾아보니 &lt;b&gt;num_workers의 값이 크면 발생할 수 있는 현상&lt;/b&gt;이라고 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;- num_workers란?&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;num_workers는 &lt;b&gt;PyTorch의 DataLoader가 데이터를 로드할 때 사용할 서브 프로세스(worker)의 개수&lt;/b&gt;를 의미합니다.&lt;/span&gt;&lt;br /&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기본적으로 DataLoader는 데이터를 배치 단위로 불러오는데, num_workers를 늘리면 여러 개의 프로세스가 &lt;b&gt;병렬로 데이터를 로드&lt;/b&gt;해서 속도를 높힐 수 있다는 장점이 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;- num_workers=0 vs num_workers&amp;gt;0&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;403&quot; data-start=&quot;263&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;num_workers=0:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-end=&quot;403&quot; data-start=&quot;286&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;332&quot; data-start=&quot;286&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;싱글 프로세스에서 데이터를 로드 (메인 프로세스가 직접 데이터 로드)&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;366&quot; data-start=&quot;335&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;데이터가 하나씩&lt;b&gt; 순차적으로 로드됨 &amp;rarr; 느림&lt;/b&gt;&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;403&quot; data-start=&quot;369&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;디버깅할 때 사용하기 좋음 (멀티프로세싱 관련 오류 방지)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li data-end=&quot;540&quot; data-start=&quot;405&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;num_workers&amp;gt;0:&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-end=&quot;540&quot; data-start=&quot;428&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;460&quot; data-start=&quot;428&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;멀티프로세싱을 사용해서 데이터를 병렬로 로드&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;501&quot; data-start=&quot;463&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;여러 worker가 동시에 데이터를 로드 &amp;rarr; 속도 증가&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;540&quot; data-start=&quot;504&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;단, 너무 많으면 CPU 과부하 및 I/O 병목 발생 가능&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;- num_workers가 성능에 간접적으로 영향을 줄 수 있는 이유는?&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;1. data shuffle에서의 문제 발생&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-end=&quot;443&quot; data-start=&quot;211&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;294&quot; data-start=&quot;211&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;num_workers가 너무 많거나 설정이 잘못되면 DataLoader의 shuffle=True가 제대로 동작하지 않을 수도 있음.&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;350&quot; data-start=&quot;295&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이 경우 특정 클래스(개수가 더 많은 클래스) 데이터가 연속으로 쌓여서 모델이 한쪽으로 편향될 수 있음.&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;443&quot; data-start=&quot;351&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특히 한 클래스의 데이터가 희소한 경우, 데이터가 고르게 섞이지 않으면 모델이 과적합되면서 데이터 수가&amp;nbsp;더 많은 클래스의 Accuracy가 낮아질 가능성이 있음.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;579&quot; data-start=&quot;544&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;2. I/O 병목으로 인해 데이터 로딩이 지연됨&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-end=&quot;784&quot; data-start=&quot;580&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;784&quot; data-start=&quot;717&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;데이터가 일부 배치에서 누락되거나 DataLoader가 불안정하게 동작하면 모델 성능이 일관되지 않을 수도 있음&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-end=&quot;990&quot; data-start=&quot;961&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;3. 특정 클래스 데이터 손실 가능성&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-end=&quot;1198&quot; data-start=&quot;991&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li data-end=&quot;1063&quot; data-start=&quot;991&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;num_workers가 높으면 멀티프로세싱으로 데이터를 병렬 로드하는 과정에서 일부 데이터가 손실될 가능성이 있음.&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;1132&quot; data-start=&quot;1064&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;특히 drop_last=True를 설정했다면 데이터수가 적은 클래스의&amp;nbsp;데이터가 일부 배치에서 빠지는 경우가 생길 수도 있음.&lt;/span&gt;&lt;/li&gt;
&lt;li data-end=&quot;1198&quot; data-start=&quot;1133&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이 경우 모델이 작은 데이터를 충분히 학습하지 못해서 데이터가 더 적은 클래스의 Accuracy가 낮아질 가능성이 있음.&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;hr contenteditable=&quot;false&quot; data-ke-type=&quot;horizontalRule&quot; data-ke-style=&quot;style5&quot; /&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;보통 적절한 값은 4~8 사이라고 하고,&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;실제로 num_workers의 값을 16-&amp;gt; 8로 변경했을때 학습이 좀 더 안정적으로 진행되었고 validation accuracy의 변동폭도 감소했습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Pytorch</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/181</guid>
      <comments>https://ga02-ailab.tistory.com/181#entry181comment</comments>
      <pubDate>Mon, 31 Mar 2025 17:13:54 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] error: (-215:Assertion failed) cn == CV_MAT_CN(dstType) &amp;amp;&amp;amp; ddepth &amp;gt;= sdepth in function 'getLinearFilter'</title>
      <link>https://ga02-ailab.tistory.com/180</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 전체 에러 문구&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1739167329266&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;cv2.error: OpenCV(4.9.0) /io/opencv/modules/imgproc/src/filter.simd.hpp:3231: error: (-215:Assertion failed) cn == CV_MAT_CN(dstType) &amp;amp;&amp;amp; ddepth &amp;gt;= sdepth in function 'getLinearFilter'&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;OpenCV에서 seamlessClone함수 사용시 발생 할 수 있는 에러인데요.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;seamlessClone()의 파라미터로 주어지는 mask와 src 이미지의 데이터 타입이 달라 발생합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;두 개의 데이터 타입을 맞춰주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 해결방법&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1739167360716&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;mask = mask.astype(np.uint8)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/180</guid>
      <comments>https://ga02-ailab.tistory.com/180#entry180comment</comments>
      <pubDate>Mon, 17 Mar 2025 10:26:50 +0900</pubDate>
    </item>
    <item>
      <title>Linux  ssh 연결계정 비밀번호 까먹었을 때 비밀번호 변경하기</title>
      <link>https://ga02-ailab.tistory.com/179</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;Linux 서버에 ssh 연결시 비밀번호을 잊어버렸을 때, 기존 비밀번호를 몰라도 비밀번호를 변경할 수 있는 방법이 있는데요.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;1. 먼저 서버의 관리자 계정으로 로그인해줍니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;2. 아래 명령어를 입력합니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1738292937538&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;sudo passwd {나의 아이디}&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;3. 아래 문구가 뜨면 새로운 비밀번호를 입력합니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;716&quot; data-origin-height=&quot;118&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/FWkro/btsL25Vp7Ee/IQRRa4QUWJJz8qfj04KDh1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/FWkro/btsL25Vp7Ee/IQRRa4QUWJJz8qfj04KDh1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/FWkro/btsL25Vp7Ee/IQRRa4QUWJJz8qfj04KDh1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FFWkro%2FbtsL25Vp7Ee%2FIQRRa4QUWJJz8qfj04KDh1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;716&quot; height=&quot;118&quot; data-origin-width=&quot;716&quot; data-origin-height=&quot;118&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Linux</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/179</guid>
      <comments>https://ga02-ailab.tistory.com/179#entry179comment</comments>
      <pubDate>Thu, 27 Feb 2025 09:54:11 +0900</pubDate>
    </item>
    <item>
      <title>ValueError: assignment destination is read-only</title>
      <link>https://ga02-ailab.tistory.com/178</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 전체 에러 문구&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1737615622330&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ValueError: assignment destination is read-only&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;numpy 행렬의 값을 변경하려고 할 때 발생하는 에러입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 해결방법&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;아래처럼 &lt;b&gt;ori_np.setflags(write=1)&lt;/b&gt; 로 속성을 변경해주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1737615726342&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ori_np=ori_np.copy()
ori_np.setflags(write=1)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/178</guid>
      <comments>https://ga02-ailab.tistory.com/178#entry178comment</comments>
      <pubDate>Mon, 10 Feb 2025 15:03:34 +0900</pubDate>
    </item>
    <item>
      <title>[lmdb] lmdb 파일 쓰기 및 읽기</title>
      <link>https://ga02-ailab.tistory.com/177</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;lmdb는 embeded key-value 데이터베이스 엔진입니다. 그렇기 때문에 빠른 속도로 데이터 쓰기 및 읽기가 가능하고 메모리 사용량도 낮습니다. 이러한 장점 덕분에 대규모 데이터를 사용하는 딥러닝 학습에도 자주 쓰입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;- lmdb 파일 만들기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&amp;nbsp;lmdb에 이미지 경로와 이미지 값을 저장하는 경우라고 생각해볼게요.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;먼저 데이터베이스를 열어줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735018503640&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import lmdb

env = lmdb.open('이미지폴더 경로', map_size=int(1e12))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;또한 key, value 모두 bytes 형식으로 저장하므로 encode함수를 사용합니다.&amp;nbsp; 이미지도 bytes로 저장할거기 때문에 opencv나 PIL로 읽는 것이 아닌 open 함수로 읽어주었습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735018782911&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;with env.begin(write=True) as txn:
	
    # key 값 인코딩
    path_key = &quot;path&quot;.encode()
    img_key = &quot;img_data&quot;.encode()
    
    # value 값 인코딩
    path_value = &quot;my_img.jpg&quot;.encode()
    with open(&quot;my_image.jpg&quot;, 'rb') as f:
    	img_value = f.read()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;이제 데이터베이스에 넣어줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735018927509&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;txn.put(path_key, path_value)
txn.put(img_key, img_value)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;생성이 완료되면 지정한 경로에 아래 두가지 파일이 생성됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;396&quot; data-origin-height=&quot;66&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bmjB7y/btsLwHzKiza/ISVpKcqHfvKcBWgKSsKmkK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bmjB7y/btsLwHzKiza/ISVpKcqHfvKcBWgKSsKmkK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bmjB7y/btsLwHzKiza/ISVpKcqHfvKcBWgKSsKmkK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbmjB7y%2FbtsLwHzKiza%2FISVpKcqHfvKcBWgKSsKmkK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;396&quot; height=&quot;66&quot; data-origin-width=&quot;396&quot; data-origin-height=&quot;66&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;- lmdb 파일 읽기&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;먼저 데이터베이스를 열어줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735019113168&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;env = lmdb.open(lmdb_path, readonly=True)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이제 데이터를 읽어보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735019267810&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;with env.begin() as txn:
	
    path_key = 'path'.encode()
    path_value = txn.get(key).decode()
    
    img_key = &quot;img_data&quot;.encode()
    img_value = txn.get(img_key).decode()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이렇게 하면 img_value에는 이미지의 byte값이 저장되겠죠? 이를 다시 numpy array 로 바꿔주려면 아래처럼 하면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1735019414209&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;img_array = np.frombuffer(img_value, dtype=np.uint8)
img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/177</guid>
      <comments>https://ga02-ailab.tistory.com/177#entry177comment</comments>
      <pubDate>Thu, 23 Jan 2025 15:57:58 +0900</pubDate>
    </item>
    <item>
      <title>[딥러닝 기본지식] Text-to-Image의 원리(Multi-Modal AI)</title>
      <link>https://ga02-ailab.tistory.com/176</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9; font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;[ Multi-Modal(멀티모달)이란? ]&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;과거에는 이미지를 입력으로 주면 결과물로 이미지만 나오고, 텍스트를 입력으로 주면 결과물로 텍스트만 내보내는 모델이 주를 이뤘는데요. 요즘에는 이미지를 입력으로 주면 이미지를 설명해주는 텍스트가 나오기도 하고, 텍스트로 설명을 주면 이미지를 만들어내는 모델에 대한 연구가 활발히 진행되고 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이렇게 &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;단일 데이터만 사용하는 것이 아닌 여러 데이터를 한번에 사용하는 것을 &quot;Multi-Modal(멀티모달)&quot;&lt;/b&gt;&lt;/span&gt;이라고 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #c1bef9; color: #000000;&quot;&gt;&lt;b&gt;[ 하나의 모델이 Multi-Modal 데이터를 이해하는 방법 ]&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지를 다루는 신경망들은 대부분 CNN으로 이루어져있고, 텍스트를 다루는 대표적인 신경망에는 Transformer가 있습니다. 그럼 하나의 모델이 이미지도 이해하고 텍스트도 이해하게 하려면 어떻게 해야 할까요?&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;간단하게 설명하자면 &lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;이미지와 텍스트 각각 latent space안에서 embedding vector를 갖도록 학습하고, 그&amp;nbsp; latent space상에서 유사한 의미를 갖는 이미지-텍스트 쌍을 이어주는 것입니다.&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;간단히 예시를 들어볼게요. &quot;사과&quot;를 입력하면 사과의 이미지를 output으로 내고, &quot;바나나&quot;를 입력하면 바나나 이미지를 output으로 내는 모델을 학습시켜보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 &quot;사과&quot;, &quot;바나나&quot; 텍스트를 한 latent space 에 위치하도록 학습하고, 사과와 바나나 이미지 또한 또 다른 latent space에 위치하도록 학습합니다. 그리고 이 두 latent space 상에서 유사한 것들을 연결지을 수 있도록 학습시켜주면 됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;구체적인 구조는 Text-to-Image 의 가장 유명한 모델로 꼽히는 DALLE2의 구조도를 이용해 설명하겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[&amp;nbsp; DALLE2 구조 ]&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;다운로드.jpg&quot; data-origin-width=&quot;801&quot; data-origin-height=&quot;287&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/TBRmm/btsK4YXA9re/Lz5nMmwN217kF3hkCT3gjK/img.jpg&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/TBRmm/btsK4YXA9re/Lz5nMmwN217kF3hkCT3gjK/img.jpg&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/TBRmm/btsK4YXA9re/Lz5nMmwN217kF3hkCT3gjK/img.jpg&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FTBRmm%2FbtsK4YXA9re%2FLz5nMmwN217kF3hkCT3gjK%2Fimg.jpg&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;801&quot; height=&quot;287&quot; data-filename=&quot;다운로드.jpg&quot; data-origin-width=&quot;801&quot; data-origin-height=&quot;287&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;step 1) prompt가 입력되면, 텍스트 인코더를 거쳐 텍스트 임베딩 값을 추출&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;step 2) 추출된 임베딩을 프라이어에 넣어 유사한 쌍으로 판단된 이미지 임베딩 추출&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;step 2) 이미지 임베딩과 prompt를 디코더에 입력으로 넣어 이미지 생성&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc9af; color: #000000;&quot;&gt;&lt;b&gt;- 텍스트 인코더&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;OpenAI에서 만든 CLIP 모델을 사용합니다. 텍스트-이미지 쌍으로 된 4억개의 데이터 셋을 이용해&amp;nbsp;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;두 latent space 상에서 유사한 것들을 연결지을 수 있도록 학습합니다. 아래는 CLIP의 구조도입니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;484&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/SxiCD/btsK5zb3aQR/1BpCKUDdL8jD9ki0YWPsxK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/SxiCD/btsK5zb3aQR/1BpCKUDdL8jD9ki0YWPsxK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/SxiCD/btsK5zb3aQR/1BpCKUDdL8jD9ki0YWPsxK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FSxiCD%2FbtsK5zb3aQR%2F1BpCKUDdL8jD9ki0YWPsxK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1280&quot; height=&quot;484&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;484&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;텍스트와 이미지를 입력으로 받아 각각 임베딩합니다. 총 데이터셋이 4억개이러므로 4억개 쌍에 대한 cosine similarity를 구합니다. 유사도가 가장 높은 값이 연관있는 텍스트-이미지 쌍입니다. 따라서 텍스트를 입력하면 텍스트와 유사한 이미지 임베딩을 얻을 수 있고, 이미지를 입력하면 이미지와 유사한 텍스트임베딩을 얻을 수 있게됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;- 프라이어&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;이 프라이어 안에 diffusion 모델이 들어가는 것입니다. 텍스트를 CLIP에 넣어 유사한 이미지 임베딩을 얻어내고 그 이미지 임베딩들로 diffusion모델을 학습합니다.&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;이 이미지 임베딩에 noise를 추가합니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;기존의 diffusion과 동일하게 noise를 찾아낼 수 있게 모델을 학습하면 됩니다. 그럼 학습이 끝나면 이미지에 어떤 noise가 들어갔는지 알아낼 수 있고 noise가 추가되기 전의 이미지도 생성해낼 수 있는 것이죠.&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;- 디코더&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;디코더는 전반적으로 GLIDE기반으로 되어있습니다. GLIDE는 noise를 제거하는 과정에 텍스트 임베딩으로 안내하는 방식인데, DALLE2에서는 프라이어가 추출해낸 이미지 임베딩을 함께 넣어줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;[ 정리 ]&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;멀티모달 모델은 텍스트 인코더+프라이어+ 이미지 디코더로 구성됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;CLIP과 GLIDE만으로도 이미지 생성은 가능한데 프라이어가 있어야 prompt 에 포함된 모든 정보를 반영할 수 있어 더 품질이 높은 이미지가 생성된다고 합니다. 여기서도 강력한 diffusion의 성능을 알 수가 있네요.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>AI Research/Deep Learning</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/176</guid>
      <comments>https://ga02-ailab.tistory.com/176#entry176comment</comments>
      <pubDate>Thu, 2 Jan 2025 10:28:00 +0900</pubDate>
    </item>
    <item>
      <title>Missing key(s) in state_dict: &amp;quot;clip_model.vision_tower.vision_model.embeddings.position_ids&amp;quot;.</title>
      <link>https://ga02-ailab.tistory.com/175</link>
      <description>&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199; color: #000000; text-align: start;&quot;&gt;- 전체 에러문구&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1731460693855&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;Missing key(s) in state_dict: &quot;clip_model.vision_tower.vision_model.embeddings.position_ids&quot;.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #000000; text-align: start;&quot;&gt;pretrain 모델을 load_state_dict를 사용하여 업로드 할 때 모델의 구조가 맞지 않아 발생하는 에러입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #f6e199; color: #000000; text-align: start;&quot;&gt;- 해결방법&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #000000; text-align: start;&quot;&gt;load_state_dic 함수의 파라미터로 &lt;b&gt;strict=False&lt;/b&gt; 를 추가해주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1731460756523&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;model.load_state_dict(ckpt, strict=False)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #000000; text-align: start;&quot;&gt;이 파라미터를 추가해주면 모델을 불러올 떄 불러올 수 있는 값들만 유동적으로 불러올 수 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/175</guid>
      <comments>https://ga02-ailab.tistory.com/175#entry175comment</comments>
      <pubDate>Thu, 12 Dec 2024 10:44:36 +0900</pubDate>
    </item>
    <item>
      <title>TypeError: load_checkpoint_and_dispatch() got an unexpected keyword argument 'force_hooks'</title>
      <link>https://ga02-ailab.tistory.com/174</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- 전체 에러 문구&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1731382332077&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;TypeError: load_checkpoint_and_dispatch() got an unexpected keyword argument 'force_hooks'&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;accelerate 의 버전이 낮아 생기는 문제입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;현재 저의 버전은 0.21.0입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;pip list | grep accelerate 로 확인 할 수 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;blob&quot; data-origin-width=&quot;1260&quot; data-origin-height=&quot;76&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/NO6nv/btsKF7TynRR/cuLTC4j27wNCO7VsGe16tK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/NO6nv/btsKF7TynRR/cuLTC4j27wNCO7VsGe16tK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/NO6nv/btsKF7TynRR/cuLTC4j27wNCO7VsGe16tK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FNO6nv%2FbtsKF7TynRR%2FcuLTC4j27wNCO7VsGe16tK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;662&quot; height=&quot;40&quot; data-filename=&quot;blob&quot; data-origin-width=&quot;1260&quot; data-origin-height=&quot;76&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;accelerate==0.30.0&amp;nbsp; &lt;/span&gt;&lt;/b&gt;으로 업그레이드 해주면 해결됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1731382456341&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;pip install accelerate==0.30.0&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/174</guid>
      <comments>https://ga02-ailab.tistory.com/174#entry174comment</comments>
      <pubDate>Fri, 29 Nov 2024 12:37:14 +0900</pubDate>
    </item>
    <item>
      <title>sd-x2-latent-upscaler 모델로 image upscale 하기</title>
      <link>https://ga02-ailab.tistory.com/173</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;diffusion을 기반으로 하는 image&amp;nbsp;upscale 모델이 아주 많은데요.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;많이 쓰이는 모델 중 하나인 sd-x2-latent-upscaler를 이용해 image&amp;nbsp;upscale을 진행해보겠습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;전체 코드는 아래와 같습니다.&lt;/p&gt;
&lt;pre id=&quot;code_1731310059066&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;from diffusers import StableDiffusionLatentUpscalePipeline
import torch


upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained(&quot;stabilityai/sd-x2-latent-upscaler&quot;, torch_dtype=torch.float16)
upscaler.to(&quot;cuda&quot;)

prompt = &quot;(photorealistic:1.4), best quality&quot;
generator = torch.manual_seed(42)

low_res_latents = Image.open(&quot;image.png&quot;)
upscaled_image = upscaler(
    prompt=prompt,
    image=low_res_latents,
    num_inference_steps=20,
    guidance_scale=0,
    generator=generator,
).images[0]


upscaled_image.save(&quot;result.png&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;inference_step 은 20만 진행해주겠습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;- before&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1018&quot; data-origin-height=&quot;1020&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Bh0ZM/btsKDCVhFaN/FIBipvcRrlJsFak9xLR3Rk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Bh0ZM/btsKDCVhFaN/FIBipvcRrlJsFak9xLR3Rk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Bh0ZM/btsKDCVhFaN/FIBipvcRrlJsFak9xLR3Rk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBh0ZM%2FbtsKDCVhFaN%2FFIBipvcRrlJsFak9xLR3Rk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;494&quot; height=&quot;495&quot; data-origin-width=&quot;1018&quot; data-origin-height=&quot;1020&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;- after&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1392&quot; data-origin-height=&quot;1390&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bpxdxK/btsKDsL8a55/skcV7lnJpJX5JpdMhhP2U0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bpxdxK/btsKDsL8a55/skcV7lnJpJX5JpdMhhP2U0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bpxdxK/btsKDsL8a55/skcV7lnJpJX5JpdMhhP2U0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbpxdxK%2FbtsKDsL8a55%2FskcV7lnJpJX5JpdMhhP2U0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1392&quot; height=&quot;1390&quot; data-origin-width=&quot;1392&quot; data-origin-height=&quot;1390&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;기존 화질 손상 없이 잘 upscale 되었습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/173</guid>
      <comments>https://ga02-ailab.tistory.com/173#entry173comment</comments>
      <pubDate>Thu, 21 Nov 2024 10:22:56 +0900</pubDate>
    </item>
    <item>
      <title>safetensor 모델을 diffusers에서 사용 가능하게 변경하기</title>
      <link>https://ga02-ailab.tistory.com/172</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;diffusion을 사용하시는 분들이라면&amp;nbsp;&lt;a href=&quot;https://civitai.com/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt; civitai&lt;/a&gt;에서 다양한 모델을 다운받아 사용하실텐데요.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;civitai에서는 모델을 .safetensors의 형태로 제공합니다. 하지만 diffusers에서는 scheduler, text_encoder, tokenizer, unet, vae 가 각각 다른 폴더에 저장되어 있는 파일 구조를 원합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이를 위해 diffusers에서 제공하는 convert_original_stable_diffusion_to_diffusers.py를 사용하면 됩니다. 코드는 아래 github에서 제공하고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;a href=&quot;https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1729561979685&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;diffusers/scripts/convert_original_stable_diffusion_to_diffusers.py at main &amp;middot; huggingface/diffusers&quot; data-og-description=&quot;  Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX. - huggingface/diffusers&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py&quot; data-og-url=&quot;https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bn34aw/hyXlMv8AYA/nKaAgBO5bXPcFk3cH0d6T0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/bNUHCv/hyXlWr0Jlx/y15ZWcPZHfdx9C8nzAXTOk/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bn34aw/hyXlMv8AYA/nKaAgBO5bXPcFk3cH0d6T0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600,https://scrap.kakaocdn.net/dn/bNUHCv/hyXlWr0Jlx/y15ZWcPZHfdx9C8nzAXTOk/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;diffusers/scripts/convert_original_stable_diffusion_to_diffusers.py at main &amp;middot; huggingface/diffusers&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;  Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX. - huggingface/diffusers&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;실행 방법은 아래와 같습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1729562024743&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;python scripts/convert_original_stable_diffusion_to_diffusers.py \
    --checkpoint_path sd_xl_base_1.0.safetensors \
    --dump_path ./output_dir \
    --pipeline_class_name StableDiffusionXLPipeline \
    --from_safetensors&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;코드 실행 후 sd_xl_base_1.0.safetensors 파일이 output_dir 폴더에&amp;nbsp;&lt;span style=&quot;text-align: start;&quot;&gt;scheduler, text_encoder, tokenizer, unet, vae&amp;nbsp; 로 각각 분해된 것을 확인할 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/172</guid>
      <comments>https://ga02-ailab.tistory.com/172#entry172comment</comments>
      <pubDate>Mon, 11 Nov 2024 11:48:50 +0900</pubDate>
    </item>
    <item>
      <title>[4] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models</title>
      <link>https://ga02-ailab.tistory.com/171</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper]&lt;/span&gt; &lt;a href=&quot;https://arxiv.org/pdf/2308.06721&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2308.06721&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github]&lt;/span&gt; &lt;a href=&quot;https://github.com/tencent-ailab/IP-Adapter&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/tencent-ailab/IP-Adapter&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1728886131753&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - tencent-ailab/IP-Adapter: The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to &quot; data-og-description=&quot;The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt. - GitHub - tencent-ailab/IP-Adapter: The image prompt adapter is des...&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/tencent-ailab/IP-Adapter&quot; data-og-url=&quot;https://github.com/tencent-ailab/IP-Adapter&quot; data-og-image=&quot;&quot;&gt;&lt;a href=&quot;https://github.com/tencent-ailab/IP-Adapter&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/tencent-ailab/IP-Adapter&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url();&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - tencent-ailab/IP-Adapter: The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt. - GitHub - tencent-ailab/IP-Adapter: The image prompt adapter is des...&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;i&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Abstract&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;최근 high-fidelity 이미지들을 생성하는 text-to-image diffusion 모델이 많이 공개되고 있는데요, 하지만 복잡한 prompt engineering 을  수반하기 때문에 text prompt만으로 원하는 이미지를 생성하기는 어렵습니다. 그렇다면 &lt;span style=&quot;text-align: start;&quot;&gt;text prompt를 대체할 수 있는 것이 있을까요? 본 논문에서는 이런 문구를 제시합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;&quot;an image is worth a thousand words&quot;.&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;바로 이미지 자체를 prompt로 사용하자라는 아이디어입니다. 이러한 시도는 전에도 존재했었습니다. 하지만 대규모 computing resource가 필요하고, 다른 베이스 모델, text prompt, structural controls들과 양립할 수 없다는 단점이 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 논문에서는 이러한 문제점들을 개선하여 &lt;b&gt;IP-Adapter&lt;/b&gt;를 제안합니다. 좀 더 자세한 내용은 introduction에서 설명하겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Introduction&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1418&quot; data-origin-height=&quot;518&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cbvtCb/btsJ5x6Jehj/Y8xMin6JCGkrv4SDQnVegK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cbvtCb/btsJ5x6Jehj/Y8xMin6JCGkrv4SDQnVegK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cbvtCb/btsJ5x6Jehj/Y8xMin6JCGkrv4SDQnVegK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcbvtCb%2FbtsJ5x6Jehj%2FY8xMin6JCGkrv4SDQnVegK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1418&quot; height=&quot;518&quot; data-origin-width=&quot;1418&quot; data-origin-height=&quot;518&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이미지는 텍스트 보다 훨씬 더 많은 것을 표현 할 수 있습니다. 따라서 본 논문에서는 텍스트가 가진 문제점을 해결하기 위해 이미지 prompt를 사용하는 방법을 제안합니다. 앞서 설명했듯이 이러한 방법은 이전의 연구에서 이미 시도되었습니다. 대표적인 예로 &lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;a href=&quot;https://huggingface.co/lambdalabs/sd-image-variations-diffusers&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;SD Image Variations&lt;/a&gt; 와 &lt;a href=&quot;https://huggingface.co/stabilityai/stable-diffusion-2-1-unclip&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Stable unCLIP&lt;/a&gt;이 있습니다. 이 두 연구 모두 효과적인 image prompt 사용을 위해 image embedding에서 text-conditioned diffusion model 직접 fine-tuning 하는 것이 효과적임을 증명했습니다. &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #fdfdfd; text-align: start;&quot;&gt;하지만 아주 명확한 단점도 존재했는데요,&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #fdfdfd; text-align: start;&quot;&gt;첫째, 기존 모델에서 text prompt로 이미지를 생성하는 기능을 제거하고 fine-tuning하기 위해 대규모 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;computing resource가 필요합니다. 둘째, image prompt모델은 다른 &lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;text-to-image base 모델에서 파생된 다른 커스텀 모델로 transfer 할 수 없기 때문에 &lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; text-align: start;&quot;&gt;재사용이 불가능합니다. &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;fine-tuning이 갖는 단점으로 인해 일부 연구는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;fine-tuning을 제거하고 text-encoder를 image-encoder로 대체하는 방식을 선택했습니다. 하지만,,, 이 방법 또한 문제가 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #fdfdfd; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;image prompt만 입력으로 받기 때문에 image와 text 두 가지 모두를 사용할 수 없었고, 이는 생성된 이미지의 퀄리티 감소로 이어지게됩니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #fdfdfd; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;본 논문에서는 앞서 언급한 문제점들이 f text-to-image diffusion의 cross-attention module에 있다고 주장합니다. pretrained diffusion 모델의 cross-attention layer 에 있는 key, value projection weights는 text feature를 조정하도록 학습됩니다. 이 방식은 결과적으로 image features 를 text features에 맞게 alignment 할 수 있지만, 잠재적으로는 이미지의 특정 정보를 놓치거나 reference 이미지와 coarse-grained controllable generation을 초래할 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;따라서, 본 논문에서는 이러한 점들 개선한&lt;b&gt; IP-Adapter&lt;/b&gt;를 제안합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;IP-Adapter는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;image features 와 text features에 &lt;b&gt;분리된 &lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;b&gt;cross-attention메커니즘&lt;/b&gt;을 채택했습니다. diffusion의 UNet에 있는 모든 cross-attention layer에 image feature만을 위한 cross-attention layer가 추가됩니다. 학습단계에서 이 layer만 학습되고, 기존의 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;UNet은 freezing 됩니다. 때문에 parameter도 22M개로 &lt;b&gt;매우 가볍고 효율적&lt;/b&gt;입니다. text prompt와 호환될 뿐만 아니라 성능 또한 우수하다고 합니다. 제안된 IP-Adapter를 사용하면 위 그림과 같이 다양한 이미지를 쉽게 생성해낼 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2. Method&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1356&quot; data-origin-height=&quot;690&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/blVqIa/btsJ43SxiV2/ekhmPbfo8MOukC1Y8oOBoK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/blVqIa/btsJ43SxiV2/ekhmPbfo8MOukC1Y8oOBoK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/blVqIa/btsJ43SxiV2/ekhmPbfo8MOukC1Y8oOBoK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FblVqIa%2FbtsJ43SxiV2%2FekhmPbfo8MOukC1Y8oOBoK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;736&quot; height=&quot;375&quot; data-origin-width=&quot;1356&quot; data-origin-height=&quot;690&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.1 Prelimiaries&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;diffusion 모델은 forward/backward 두 개의 프로세스로 구성되어 있습니다. diffusion의 작동 방식이 궁금하다면 &lt;a href=&quot;https://ga02-ailab.tistory.com/130&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;여기&lt;/a&gt;를 참고해주세요!&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.2 Image Prompt Adapter&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;본 논문에서 image prompt adapter는 pretrained text-to-image diffusion 모델이 image prompt로 이미지를 생성할 수 있도록 설계되었습니다. 현재 adapter는 fine-tuned image prompt model 또는 처음부터 학습된 모델의 성능을 일치 시키기 어렵습니다. 대부분의 방법들은 freezing 되는 cross-attention layers에 단순히 concat된 feature를 입력으로 주어&amp;nbsp; image prompt로 부터 fine-grained features 를 잡아내는 것을 방지합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;이 문제를 방지하기 위해 image feature는 새로운 cross-attention layers로 embedding되는 decoupled cross-attention 방법을 선택했습니다. 따라서 위 구조도에서 확인할 수 있듯이 IP-Adapter는 두 개의 부분으로 구성됩니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;- image encoder: image prompt로 부터 image feature를 추출하는 부분&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;- adapted module: pretrained text-to-image diffusion 모델로 image feature를 embed하기 위한 decoupled cross-attention&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;2.2.1 Image Encoder&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;대부분의 방법들과 유사하게 본 논문에서도 pretrained CLIP image encoder를 사용해 image prmpt에서 image feature를 추출합니다. 이 feature 가 global image embedding 인데, image caption과 이미지의 스타일/content가 잘 정렬되어 있습니다. 학습단계에서는 CLIP은 freezing 입니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;효율적으로 global image embedding을 분해하기 위해 , 길이 N의 embedding feature로 만드는데 이때 작은 projection network를 사용합니다. 이 차원은 text feature의 차원과 동일합니다. projection network는 linear layer 와  Layer Normalization으로 구성됩니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;2.2.2 Decoupled Cross-Attention&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;image feature는 decoupled cross-attention 에 의해 pretrained UNet으로 통합됩니다. 기본 SD 모델에서는, CLIP text encoder의 text feature가 cross-attention layers에 feeding되어 UNet모델에 결합됩니다. query features &lt;i&gt;Z&lt;/i&gt; 와&amp;nbsp; text features &lt;i&gt;c_t&lt;/i&gt;가 주어지면 cross-attention의 아웃풋&lt;i&gt; Z'&lt;/i&gt;은 다음과 같이 정의될 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;740&quot; data-origin-height=&quot;98&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bunDGz/btsJ5AvHYNv/sH0V8Mogj1GzfRAkwoWUaK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bunDGz/btsJ5AvHYNv/sH0V8Mogj1GzfRAkwoWUaK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bunDGz/btsJ5AvHYNv/sH0V8Mogj1GzfRAkwoWUaK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbunDGz%2FbtsJ5AvHYNv%2FsH0V8Mogj1GzfRAkwoWUaK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;453&quot; height=&quot;60&quot; data-origin-width=&quot;740&quot; data-origin-height=&quot;98&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;Q = ZW_q, K = c_tW_k, V = c_tW_v&lt;/i&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;W_q, W_k, W_v =&amp;gt; weight matrices &lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;image feature 삽입하는 가장 빠른 방법은 text feature와 concat후 cross attention layer에 feeding 하는 것입니다. 그러나 이 방법은 효과적이지 않아 본 논문에서는 decoupled cross-attention mechanism 방법을 제안합니다. text feature와 image rfeature를 위한 cross-attention layer가 분리된 것이죠.&amp;nbsp; 좀 더 명확하게, 기본 UNet에 있던 각각의 cross-attention layer에 새로운 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;cross-attention layer를 더하는 것입니다. image feature&lt;i&gt; c_i&lt;/i&gt;가 주어지면 새로운 cross-attention 의 아웃 풋인 &lt;i&gt;Z''&lt;/i&gt;는 아래와 같이 나타낼 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;740&quot; data-origin-height=&quot;92&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/INbvG/btsJ5eT5UgW/HhlmjyIKrio2lDXqgsn01k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/INbvG/btsJ5eT5UgW/HhlmjyIKrio2lDXqgsn01k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/INbvG/btsJ5eT5UgW/HhlmjyIKrio2lDXqgsn01k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FINbvG%2FbtsJ5eT5UgW%2FHhlmjyIKrio2lDXqgsn01k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;467&quot; height=&quot;58&quot; data-origin-width=&quot;740&quot; data-origin-height=&quot;92&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;Q = ZW_q, K' = c_tW_k', V = c_tW_v'&lt;/i&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;W_k', W_v' =&amp;gt; weight matrices&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;결과적으로 각 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;cross attention laye에 &lt;i&gt;W_k', W_v'&lt;/i&gt;만 추가해주면 됩니다. 빠른 수렴을 위해 &lt;i&gt;W_k, W_v&lt;/i&gt;로 초기화됩니다. 그 다음 image &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;cross attention의 output을 text cross-attention의 output과 더해주면 됩니다. 그러므로 최종적인 decoupled cross-attention의 수식은 아래와 같습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;924&quot; data-origin-height=&quot;134&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bwVYQ1/btsJ6o9k6x7/wocUvkF2lsvn3BZUliQek1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bwVYQ1/btsJ6o9k6x7/wocUvkF2lsvn3BZUliQek1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bwVYQ1/btsJ6o9k6x7/wocUvkF2lsvn3BZUliQek1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbwVYQ1%2FbtsJ6o9k6x7%2FwocUvkF2lsvn3BZUliQek1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;502&quot; height=&quot;73&quot; data-origin-width=&quot;924&quot; data-origin-height=&quot;134&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;이때, &lt;i&gt;W_k', W_v'&lt;/i&gt;만 학습됩니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;2.2.3 Training and Inference&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;학습 단계에서는 pretrained diffusion모델은 freezing되고 IP-Adapter만 학습합니다. IP-Adapter 또한 기존 SD와 동일하게 image-text 쌍으로 학습합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;618&quot; data-origin-height=&quot;44&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bpdCeE/btsJ5BgWHh8/VA8aRmVRSJraBmnabb9lQ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bpdCeE/btsJ5BgWHh8/VA8aRmVRSJraBmnabb9lQ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bpdCeE/btsJ5BgWHh8/VA8aRmVRSJraBmnabb9lQ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbpdCeE%2FbtsJ5BgWHh8%2FVA8aRmVRSJraBmnabb9lQ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;449&quot; height=&quot;32&quot; data-origin-width=&quot;618&quot; data-origin-height=&quot;44&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;Inference 단계에서 &lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;classifier-free guidance를 가능하게 하기 위해 image 조건을 랜덤으로 drop합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;724&quot; data-origin-height=&quot;44&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/6c0IY/btsJ6fLthJ9/I08evIr9aqautO0z1CO471/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/6c0IY/btsJ6fLthJ9/I08evIr9aqautO0z1CO471/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/6c0IY/btsJ6fLthJ9/I08evIr9aqautO0z1CO471/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F6c0IY%2FbtsJ6fLthJ9%2FI08evIr9aqautO0z1CO471%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;477&quot; height=&quot;29&quot; data-origin-width=&quot;724&quot; data-origin-height=&quot;44&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;만약 이미지 조건이 drop되었다면, CLIP 이미지 embedding을 간단히 0으로 설정하면 됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;text cross-attention과 image cross-attention이 분리되면 inference 단계에서 이미지 조건에 weight를 주는 것도 가능합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;740&quot; data-origin-height=&quot;44&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/QV6A5/btsJ6F388CX/vfDHK0H6ixoFwImXBy5wB0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/QV6A5/btsJ6F388CX/vfDHK0H6ixoFwImXBy5wB0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/QV6A5/btsJ6F388CX/vfDHK0H6ixoFwImXBy5wB0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FQV6A5%2FbtsJ6F388CX%2FvfDHK0H6ixoFwImXBy5wB0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;505&quot; height=&quot;30&quot; data-origin-width=&quot;740&quot; data-origin-height=&quot;44&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 식에서 &lt;span style=&quot;background-color: #ffffff; text-align: start;&quot;&gt;&amp;lambda; 는 weight factor이고, 이 값이 0이라면 기존의&amp;nbsp;text-to-image diffusion 모델과 동일해집니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;3. Experiment&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;3.1 Experimental Setup&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: start;&quot;&gt;3.1.1 Training Data&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;LAION-2B [42] , COYO-700M =&amp;gt; 약 1000만 개의 text-image pair&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;3.1.2 &lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;Implementation Details&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #fdfdfd; color: #000000; text-align: start;&quot;&gt;SD v1.5기반, OpenCLIP ViT-H/1 사용&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1508&quot; data-origin-height=&quot;1584&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/GMOb5/btsJ4vPExJn/QTPoGCc4qbJ0dbT9ml0qC0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/GMOb5/btsJ4vPExJn/QTPoGCc4qbJ0dbT9ml0qC0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/GMOb5/btsJ4vPExJn/QTPoGCc4qbJ0dbT9ml0qC0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FGMOb5%2FbtsJ4vPExJn%2FQTPoGCc4qbJ0dbT9ml0qC0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1508&quot; height=&quot;1584&quot; data-origin-width=&quot;1508&quot; data-origin-height=&quot;1584&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1508&quot; data-origin-height=&quot;996&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/xMvkp/btsJ4bKOqcb/erJaWUWK8f9bztZZoOvtk1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/xMvkp/btsJ4bKOqcb/erJaWUWK8f9bztZZoOvtk1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/xMvkp/btsJ4bKOqcb/erJaWUWK8f9bztZZoOvtk1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FxMvkp%2FbtsJ4bKOqcb%2FerJaWUWK8f9bztZZoOvtk1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1508&quot; height=&quot;996&quot; data-origin-width=&quot;1508&quot; data-origin-height=&quot;996&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1508&quot; data-origin-height=&quot;612&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bIdh7I/btsJ5xeSrH4/ZW8obemcEgLhYCLZunkmG1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bIdh7I/btsJ5xeSrH4/ZW8obemcEgLhYCLZunkmG1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bIdh7I/btsJ5xeSrH4/ZW8obemcEgLhYCLZunkmG1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbIdh7I%2FbtsJ5xeSrH4%2FZW8obemcEgLhYCLZunkmG1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1508&quot; height=&quot;612&quot; data-origin-width=&quot;1508&quot; data-origin-height=&quot;612&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/Diffusion Personalization</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/171</guid>
      <comments>https://ga02-ailab.tistory.com/171#entry171comment</comments>
      <pubDate>Thu, 24 Oct 2024 10:12:26 +0900</pubDate>
    </item>
    <item>
      <title>[Docker] Dockerfile 작성시 TimeZone 설정하기</title>
      <link>https://ga02-ailab.tistory.com/170</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Dockerfile 작성시 TimeZone 설정하는 방법은 아래와 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;대한민국 날짜와 시간으로 설정하고싶다면&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;b&gt;Asia/Seoul&lt;/b&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;로 작성하면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1728525573517&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ENV TZ=Asia/Seoul&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;일부 라이브러리에서는 timezone을 설정해주지 않으면 설치가 불가능하거나 에러가 발생하기 때문에 Docker 를 사용하신다면 꼭 설정해주는 것이 좋습니다!&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;color: #333333; text-align: start;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Dockerfile 작성 시 그 외 필요한 작성 문법은 아래를 참고해주세요.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1728525599054&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;FROM &amp;lt;이미지&amp;gt;:&amp;lt;태그&amp;gt;  ## dockerhub 에서 가져올 docker image
예시) FROM pytorch/pytorch:2.1.0-cuda12.1-cudnn8-devel 

RUN &amp;lt;전체 command&amp;gt;  ## 실행 할 명령어 
예시) RUN apt update 

WORKDIR &amp;lt;이동할 폴더경로&amp;gt;  ## 작업 디렉토리 전환(이 후 RUN, COPY 등의 명령어는 이 폴더 기준으로 실행)
예시) WORKDIR /work_dir

COPY &amp;lt;src dir&amp;gt;  &amp;lt;dst dir&amp;gt; ## 복사할 폴더와 목적지 경로
예시) COPY diffusers my/diffusers&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Docker</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/170</guid>
      <comments>https://ga02-ailab.tistory.com/170#entry170comment</comments>
      <pubDate>Thu, 10 Oct 2024 11:00:06 +0900</pubDate>
    </item>
    <item>
      <title>ncclInvalidArgument: Invalid value for an argument.</title>
      <link>https://ga02-ailab.tistory.com/169</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;pytorch 분산 처리 코드 부분에서 다음과 같은 에러가 발생하는 경우가 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1724312261312&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;  File &quot;/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py&quot;, line 47, in wrapper
    return func(*args, **kwargs)
  File &quot;/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py&quot;, line 2806, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, invalid argument, NCCL version 2.14.3
ncclInvalidArgument: Invalid value for an argument.
Last error:
Invalid config blocking attribute value -2147483648&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;nvidia-nccl이 서로 다른 버전으로 중복 설치되어 발생하는 에러였습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[해결 방법]&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 아래 명령어로 설치된 nvidia-nccl 목록들을 확인해줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1724312352906&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;pip list | grep nccl&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;저의 경우에는 2.14.3 과 2.18.1 두 가지가 설치되었네요&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-filename=&quot;blob&quot; data-origin-width=&quot;1202&quot; data-origin-height=&quot;100&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/IH9C0/btsKpTiYWYM/th9c5792ZbviXfoCP7QeP1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/IH9C0/btsKpTiYWYM/th9c5792ZbviXfoCP7QeP1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/IH9C0/btsKpTiYWYM/th9c5792ZbviXfoCP7QeP1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FIH9C0%2FbtsKpTiYWYM%2Fth9c5792ZbviXfoCP7QeP1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1202&quot; height=&quot;100&quot; data-filename=&quot;blob&quot; data-origin-width=&quot;1202&quot; data-origin-height=&quot;100&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;저는 cuda 12.1을 사용 중이므로 nvidia-nccl-cu11 는 삭제해주겠습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1724312440861&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;pip uninstall nvidia-nccl-cu11&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;코드를 다시 실행하면 정상 작동합니다!&amp;nbsp;&lt;/span&gt;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/169</guid>
      <comments>https://ga02-ailab.tistory.com/169#entry169comment</comments>
      <pubDate>Fri, 20 Sep 2024 17:18:55 +0900</pubDate>
    </item>
    <item>
      <title>[Pytorch] 메모리 효율적으로 사용하기</title>
      <link>https://ga02-ailab.tistory.com/168</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;딥러닝으로 task를 진행할 때 여러 모델을 연속적으로 사용해야 하는 경우가 종종 있습니다. 이 때 pytorch에서 메모리를 좀 더 효율적으로 사용할 수 있는 방법에 대해 작성해보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1) 먼저 사용할 모델을 gpu로 옮겨줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1723512503850&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;model.to($device)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2) 모델의 사용이 끝나면 바로 cpu로 옮겨줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1723512536572&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;model.to('cpu')&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3) garbage collect를 실행하고, 메모리를 비워줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1723512791479&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;gc.collect()
torch.cuda.empty_cache()
torch.cuda.ipc_collect()&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Pytorch</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/168</guid>
      <comments>https://ga02-ailab.tistory.com/168#entry168comment</comments>
      <pubDate>Fri, 30 Aug 2024 10:10:59 +0900</pubDate>
    </item>
    <item>
      <title>ValueError: Cannot load &amp;lt;class 'diffusers.models.controlnet.ControlNetModel'&amp;gt; from / because the following keys are missing: Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomly initialize those weights or else</title>
      <link>https://ga02-ailab.tistory.com/167</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[전체 에러문구]&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1723511010079&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;ValueError: Cannot load &amp;lt;class 'diffusers.models.controlnet.ControlNetModel'&amp;gt; from / because the following keys are missing:
 Please make sure to pass `low_cpu_mem_usage=False` and `device_map=None` if you want to randomly initialize those weights or else make sure your checkpoint file is correct.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;아래 코드로 controlnet 모델을 로드 할&amp;nbsp; 때 발생하는 에러입니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1723511163884&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;controlnet = ControlNetModel.from_pretrained('controlnet',
                                            torch_dtype = torch.float16,
                                            )&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;low_cpu_mem_usage, device_map 두 가지 parameter를 추가해 해결할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1723511220432&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;controlnet = ControlNetModel.from_pretrained('controlnet',
                                            torch_dtype = torch.float16,
                                            low_cpu_mem_usage=False,
                                            device_map=None
                                            )&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/167</guid>
      <comments>https://ga02-ailab.tistory.com/167#entry167comment</comments>
      <pubDate>Tue, 13 Aug 2024 10:07:57 +0900</pubDate>
    </item>
    <item>
      <title>[3] PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding</title>
      <link>https://ga02-ailab.tistory.com/166</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper] &lt;a href=&quot;https://openaccess.thecvf.com//content/CVPR2024/papers/Li_PhotoMaker_Customizing_Realistic_Human_Photos_via_Stacked_ID_Embedding_CVPR_2024_paper.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://openaccess.thecvf.com//content/CVPR2024/papers/Li_PhotoMaker_Customizing_Realistic_Human_Photos_via_Stacked_ID_Embedding_CVPR_2024_paper.pdf&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a href=&quot;https://github.com/TencentARC/PhotoMaker&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/TencentARC/PhotoMaker&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1720144224446&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - TencentARC/PhotoMaker: PhotoMaker&quot; data-og-description=&quot;PhotoMaker. Contribute to TencentARC/PhotoMaker development by creating an account on GitHub.&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/TencentARC/PhotoMaker&quot; data-og-url=&quot;https://github.com/TencentARC/PhotoMaker&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bw1uL1/hyWvR53uE4/tEnfaTe9PSWyTu76PLNjy0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/TencentARC/PhotoMaker&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/TencentARC/PhotoMaker&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bw1uL1/hyWvR53uE4/tEnfaTe9PSWyTu76PLNjy0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - TencentARC/PhotoMaker: PhotoMaker&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;PhotoMaker. Contribute to TencentARC/PhotoMaker development by creating an account on GitHub.&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;바로 앞의 FaceChain과 마찬가지로 personalize 이미지 생성을 위해 어떻게 인물의 identity를 보존할 것인지에 대한 방법을 제시한 논문입니다.&amp;nbsp; 이를 위해 본 논문에서 제안하는 방법은 입력 이미지들의 embedding 값들을 stacking하는 방법입니다. 이러한 방법은 &lt;span style=&quot;background-color: #ffffff; text-align: start;&quot;&gt;다양한 identity의 정보를 유지하고 포괄적으로 담을 수 있습니다.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: start;&quot;&gt;또한 DreamBooth와 같은 다른 모델들 보다 굉장히 빠르고 품질이 높으며, 높은 일반화 성능을 보여준다고 합니다. 뿐만 아니라 Attributes change, stylelization, Identity mixing 등 다양하게 응용 가능합니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2252&quot; data-origin-height=&quot;1544&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/U6an9/btsIn7CwPJB/ZCGF2ARrtZ2r02LB7Pzgs1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/U6an9/btsIn7CwPJB/ZCGF2ARrtZ2r02LB7Pzgs1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/U6an9/btsIn7CwPJB/ZCGF2ARrtZ2r02LB7Pzgs1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FU6an9%2FbtsIn7CwPJB%2FZCGF2ARrtZ2r02LB7Pzgs1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2252&quot; height=&quot;1544&quot; data-origin-width=&quot;2252&quot; data-origin-height=&quot;1544&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Method&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.1 overview&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2142&quot; data-origin-height=&quot;1028&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d6lUWT/btsIoqIytwQ/6LcUNTNRrHMhumCkkPEJC1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d6lUWT/btsIoqIytwQ/6LcUNTNRrHMhumCkkPEJC1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d6lUWT/btsIoqIytwQ/6LcUNTNRrHMhumCkkPEJC1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd6lUWT%2FbtsIoqIytwQ%2F6LcUNTNRrHMhumCkkPEJC1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2142&quot; height=&quot;1028&quot; data-origin-width=&quot;2142&quot; data-origin-height=&quot;1028&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.2 Stacked ID Embedding&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.2.1 Encoders&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;CLIP image encoder를 사용합니다. 인물과 관련된(신체) 영역을 제외하고는 모두 노이즈로 채워 배경이 미치는 영향을 최소화합니다. CLIP은 기본적으로 natural image에 대해서 학습되었기에 masking 된 이미지에서 embedding 값을 더 잘 추출하기위해 transformer의 일부 layer를 fine-tuning 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.2.2 Stacking&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;입력 caption으로부터 &quot;man&quot;, &quot;woman&quot; 과 같은 class word를 찾고 그 위치의 feature vector를 추출합니다. 이 vector는 앞서&amp;nbsp; 구한 image embedding 값들과 fusion됩니다.&amp;nbsp; 2 MLP&amp;nbsp; layer가 사용됩니다. 이때 다른 모델들과의 차이점은 입력 이미지들을 하나의 embedding vector로 통합해버리는 것이 아니라 stack을 유지하며 모델로 forwarding 한다는 점입니다. 그렇기에 여러 사람의 identity를 mixing하여 새로운 인물을 생성해내는 것도 가능합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.2.3 Merging&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;stacked ID embedding 에 포함된 정보들을 merger하기 위해 diffusion model의 cross-attention mechanism을 사용합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;954&quot; data-origin-height=&quot;158&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oWftD/btsIpwPfsOf/U149Xe6rykcF3P1VzramC0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oWftD/btsIpwPfsOf/U149Xe6rykcF3P1VzramC0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oWftD/btsIpwPfsOf/U149Xe6rykcF3P1VzramC0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FoWftD%2FbtsIpwPfsOf%2FU149Xe6rykcF3P1VzramC0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;406&quot; height=&quot;67&quot; data-origin-width=&quot;954&quot; data-origin-height=&quot;158&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;또한 prompt weighting 을 통해서 각 이미지들의 기여도를 조절할 수 있고, 다른 모델들처럼 LoRA를 사용해 attention layer를 학습합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3 ID-Oriented Human Data Construction&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이번 절에서는 human-centric text-image dataset 설계를 위한 pipeline을 소개합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3.1 Image downloading&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;VGGFace2가 포함하고 있는 인물들의 크롤링하여 각 100장씩 수집했습니다. 고품질의 데이터를 얻기 위해 해상도 512 이상인 이미지만 수집합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3.1 Face detection and filtering&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;RetinaNet을 이용해 얼굴 영역을 탐지하고, 얼굴이 포함되지 않거나 얼굴 사이즈가 256 &amp;times; 256보다 작은 이미지는 filtering 합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3.2 ID verification&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;입력 이미지에 여러 사람의 얼굴이 포함되어 있을 수 있으므로 ArcFace를 통해 embedding 한 후 같은 인물 얼굴들끼리 grouping합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3.3 Cropping and segmentation&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;얼굴 영역보다 좀 더 크게 이미지를 자르고 그 중 얼굴이 10% 이상을 차지하고 있는지 확인합니다. image encoder로 포워딩 전 배경을 제거해야 하므로 Mask2Former를 이용해 마스킹합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3.4 Captioning and marking&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;BLIP2를 이용해 crop된 이미지로부터 caption을 추출합니다. 이 작업을 class word가 나타날때까지 계속 반복합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. Experiments&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1902&quot; data-origin-height=&quot;1530&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uYfQB/btsIoxO2LW3/q1U39o7SyxDFoC2kPfMZK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uYfQB/btsIoxO2LW3/q1U39o7SyxDFoC2kPfMZK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uYfQB/btsIoxO2LW3/q1U39o7SyxDFoC2kPfMZK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuYfQB%2FbtsIoxO2LW3%2Fq1U39o7SyxDFoC2kPfMZK1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;697&quot; height=&quot;561&quot; data-origin-width=&quot;1902&quot; data-origin-height=&quot;1530&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2220&quot; data-origin-height=&quot;704&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/diZZaG/btsIpxOeUyK/nTdl4FiUFQ7Y4ixJbQPSG0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/diZZaG/btsIpxOeUyK/nTdl4FiUFQ7Y4ixJbQPSG0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/diZZaG/btsIpxOeUyK/nTdl4FiUFQ7Y4ixJbQPSG0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdiZZaG%2FbtsIpxOeUyK%2FnTdl4FiUFQ7Y4ixJbQPSG0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;724&quot; height=&quot;230&quot; data-origin-width=&quot;2220&quot; data-origin-height=&quot;704&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1048&quot; data-origin-height=&quot;862&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/DyV7L/btsIn38YIr9/kBoM03sgZxCK7X5paQFwN1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/DyV7L/btsIn38YIr9/kBoM03sgZxCK7X5paQFwN1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/DyV7L/btsIn38YIr9/kBoM03sgZxCK7X5paQFwN1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FDyV7L%2FbtsIn38YIr9%2FkBoM03sgZxCK7X5paQFwN1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;704&quot; height=&quot;579&quot; data-origin-width=&quot;1048&quot; data-origin-height=&quot;862&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;별도의 코드 실행없이 아래 링크에서 바로 실행해 볼 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;a href=&quot;https://huggingface.co/spaces/TencentARC/PhotoMaker&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://huggingface.co/spaces/TencentARC/PhotoMaker&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1720159783109&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;PhotoMaker - a Hugging Face Space by TencentARC&quot; data-og-description=&quot;Running on Zero&quot; data-og-host=&quot;huggingface.co&quot; data-og-source-url=&quot;https://huggingface.co/spaces/TencentARC/PhotoMaker&quot; data-og-url=&quot;https://huggingface.co/spaces/TencentARC/PhotoMaker&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/b9ESxO/hyWvV1LysW/xE7gZZKnDXhxmfuN2Gpye1/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648&quot;&gt;&lt;a href=&quot;https://huggingface.co/spaces/TencentARC/PhotoMaker&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://huggingface.co/spaces/TencentARC/PhotoMaker&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/b9ESxO/hyWvV1LysW/xE7gZZKnDXhxmfuN2Gpye1/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;PhotoMaker - a Hugging Face Space by TencentARC&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Running on Zero&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;huggingface.co&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/Diffusion Personalization</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/166</guid>
      <comments>https://ga02-ailab.tistory.com/166#entry166comment</comments>
      <pubDate>Tue, 30 Jul 2024 10:44:47 +0900</pubDate>
    </item>
    <item>
      <title>[Docker] Docker 권한 문제 해결하기</title>
      <link>https://ga02-ailab.tistory.com/165</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;docker 와 관련된 명령어 실행시 권한이 없다는 에러 문구가 뜨는 경우가 종종 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1720678890624&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get &quot;http://%2Fvar%2Frun%2Fdocker.sock/v1.45/containers/json&quot;: dial unix /var/run/docker.sock: connect: permission denied&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;docker 그룹에 해당 유저를 추가해 주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;관리자 계정으로 로그인 후 아래 두 과정을 실행해줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. 먼저 docker 그룹을 생성합니다. (이미 존재한다면 건너뛰어도 됩니다.)&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1720678963792&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;sudo groupadd docker&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. 유저를 추가합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1720679003677&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;usermod -aG docker $USERID&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;추가한 유저 아이디로 재로그인 하면 해결!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Docker</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/165</guid>
      <comments>https://ga02-ailab.tistory.com/165#entry165comment</comments>
      <pubDate>Thu, 11 Jul 2024 15:24:41 +0900</pubDate>
    </item>
    <item>
      <title>[2] FaceChain: A Playground for Human-centric Artificial Intelligence Generated Content</title>
      <link>https://ga02-ailab.tistory.com/164</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper]&lt;/span&gt; &lt;a href=&quot;https://arxiv.org/pdf/2308.14256v2&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2308.14256v2&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[Github] &lt;/span&gt;&lt;a href=&quot;https://github.com/modelscope/facechain&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/modelscope/facechain&lt;/a&gt; &lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;(v 3.0.0 tag 로 들어가면 됩니다.)&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1720058037010&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - modelscope/facechain: FaceChain is a deep-learning toolchain for generating your Digital-Twin.&quot; data-og-description=&quot;FaceChain is a deep-learning toolchain for generating your Digital-Twin. - modelscope/facechain&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/modelscope/facechain&quot; data-og-url=&quot;https://github.com/modelscope/facechain&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bvNCY9/hyWrTqHzYR/NPehQLGHUfrn8vI2ZKXgh1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/modelscope/facechain&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/modelscope/facechain&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bvNCY9/hyWrTqHzYR/NPehQLGHUfrn8vI2ZKXgh1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - modelscope/facechain: FaceChain is a deep-learning toolchain for generating your Digital-Twin.&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;FaceChain is a deep-learning toolchain for generating your Digital-Twin. - modelscope/facechain&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;i&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Abstract&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1314&quot; data-origin-height=&quot;886&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/chTJAK/btsImLlESjE/DCp6tqc4TcgjBcufR3HHj0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/chTJAK/btsImLlESjE/DCp6tqc4TcgjBcufR3HHj0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/chTJAK/btsImLlESjE/DCp6tqc4TcgjBcufR3HHj0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FchTJAK%2FbtsImLlESjE%2FDCp6tqc4TcgjBcufR3HHj0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;556&quot; height=&quot;375&quot; data-origin-width=&quot;1314&quot; data-origin-height=&quot;886&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;최근 personalized image generation 분야가 굉장히 이슈인데요, 이로 인해 한 인물의 여러 이미지들로부터 그 인물의 identity를 학습하는 text-to-image 모델들이 많이 공개되고 있습니다. 하지만 대부분의 방법들은 크게 2가지의 문제점을 갖고 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;-&amp;nbsp; 생성된 인물 사진들이 입력 이미지들의 얼굴형, 얼굴 특징을 나타내지 못하고 서로 다른 인물들 임에도&amp;nbsp; output은 고유한 특성을 갖고 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;-&amp;nbsp; 생성된 얼굴이 뒤틀리거나, 흐릿한 등 불완전합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이러한 문제점들을 해결하기 위해 본 논문에서는 얼굴과 관련된 perceptual understanding model(face detection, deep face embedding extraction, and facial attribute recognition) 을 결합한 personalized portrait generation 프레임워크인 &lt;b&gt;FaceChain&lt;/b&gt;을 제안합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Introduction&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;앞서 설명했듯이 human-centric content generation의 경우 pre-trained text-to-image은 입력 얼굴들의 identity를 유지하는데 많은 한계점이 있습니다. 이러한 문제점을 해결하기 위해 얼굴 이미지들에서 identity를 학습하고, text prompt에 따라 이미지를 생성하는 process들을 따르고 있습니다. 기존의 방법들은 &amp;nbsp;LoRA (Low-Rank Adaptation)을 사용하거나 identity 정보를 학습하는 Identifier-based 를 사용했는데요,&amp;nbsp; 이 방법들은 abstract 절에서 기술한 문제점을 갖는다고 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;따라서 본 논문에서는 얼굴의 특징들을 보존하고 스타일까지 제어가능한 FaceChain을 소개합니다. 본 모델은 2개의 LoRA모델을 사용함으로써 개인화된 스타일과 identity를 통합 할 수 있도록 했다고 합니다. 또한 talking head와 같이 여러 방면으로 활용이 가능해 &lt;span style=&quot;text-align: start;&quot;&gt;&amp;nbsp;personalized image generation 분야에 많은 기여를 할 수 있을 것으로 기대하고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Architecture&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 모델은 Stable Diffusion을 기반으로 하며, personalized portrait generation 프로세스를 캡슐화하여 설계했습니다. &amp;nbsp;간단한 구조 설명은 다음과 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;style stability 와 ID consistency 를 개선하기 위해 2개의 LoRA(style-LoRA + face-LoRA)를 사용합니다. 2개의 LoRA모델은 각각 학습되는데 style-LoRA 는 offline으로 face-LoRA는 online으로 학습됩니다. 또한 입력 이미지들의 일관성을 유지하기 위해 크기, 피부 품질, 방향등을 정규화합니다. inference 단계에서는 LoRA모델의 weight를 diffusion에 통합하여 이미지를 생성합니다. 아래 그림은 FaceChain의 전체 구조도입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1496&quot; data-origin-height=&quot;872&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/buDM45/btsIlAFfQVQ/25jKyPvmaMdArxIkK9ZuQk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/buDM45/btsIlAFfQVQ/25jKyPvmaMdArxIkK9ZuQk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/buDM45/btsIlAFfQVQ/25jKyPvmaMdArxIkK9ZuQk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbuDM45%2FbtsIlAFfQVQ%2F25jKyPvmaMdArxIkK9ZuQk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1496&quot; height=&quot;872&quot; data-origin-width=&quot;1496&quot; data-origin-height=&quot;872&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #333333;&quot;&gt;2.1 Data Processing&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.1.1 Face Extraction&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;입력 데이터들로 부터 일련화된 얼굴 이미지를 얻기 위해 몇가지 process 들 거칩니다. 그 과정을 다음과 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;1) 돌아간 얼굴을 찾기 위한&lt;/span&gt; &lt;a href=&quot;https://modelscope.cn/models/Cherrytest/rot_bgr&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Image Rotation&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;2) face landamark를 이용한&lt;/span&gt; &lt;a href=&quot;https://modelscope.cn/models/iic/cv_ddsar_face-detection_iclr23-damofd&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Face Rotation&amp;nbsp;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3) 얼굴 영역을 자르고 중앙에 위치 시킴, 그 다음 human parsing을 위해 Masked-attention Mask Transformer모델을 사용하여 머리 영역에 마스크를 생성하고 분할(&lt;a href=&quot;https://modelscope.cn/models/iic/cv_resnet101_image-multiple-human-parsing&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Face Region Crop and Segmentation&lt;/a&gt;)&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;4) 입력 이미지들의 피부 품질 개선을 위한 &lt;a href=&quot;https://modelscope.cn/models/iic/cv_unet_skin-retouching&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Skin Retouching&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.1.2 Label Tagging&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;test-to-image 방식을 활용하려면 각 입력 이미지에 tagging을 해주어야합니다.&amp;nbsp; tag 생성시 주의 점은 아래와 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 얼굴표정, 액세서리 등 특정 이미지에만 있는 특징들은 image-tag간 정확한 관계 유지를 위해 정확하게 라벨링 해야 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 눈, 입술, 귀 등 identity와 관련된 tag는 곡 사용하지 않아도 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- &quot;man&quot; 과 같이 인물의 전반적을 특성을 tagging해주는 것이 더 효과적일 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 논문에서는 먼저 text annotation 모델인 &lt;a href=&quot;https://github.com/KichangKim/DeepDanbooru&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;DeepDanbooru&lt;/a&gt; 로 tag를 구한뒤, &lt;a href=&quot;https://modelscope.cn/models/iic/cv_resnet34_face-attribute-recognition_fairface&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;FairFace&lt;/a&gt;를 통해 성별, 나이 score를 얻은 후 후처리합니다. 따라서&amp;nbsp; tag는 아래 표의 6가지 중 하나(trigger word)로 분류됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;852&quot; data-origin-height=&quot;218&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/5XRxn/btsIlVbsOGf/fKtOnQpK3q1PMcq1MsLrw1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/5XRxn/btsIlVbsOGf/fKtOnQpK3q1PMcq1MsLrw1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/5XRxn/btsIlVbsOGf/fKtOnQpK3q1PMcq1MsLrw1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F5XRxn%2FbtsIlVbsOGf%2FfKtOnQpK3q1PMcq1MsLrw1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;594&quot; height=&quot;152&quot; data-origin-width=&quot;852&quot; data-origin-height=&quot;218&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.2 Model Training&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 논문에서 제공하는 hyper-parameter는 다음과 같습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- LoRA rank = 32&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- learning rate = 1e-4&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- Epoch = 20&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- optimizer = AdamW&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.3 Model Inference&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1634&quot; data-origin-height=&quot;658&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b6gUcI/btsImidfQ8Q/6YNGpDnNkt8BynuQxNIJP0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b6gUcI/btsImidfQ8Q/6YNGpDnNkt8BynuQxNIJP0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b6gUcI/btsImidfQ8Q/6YNGpDnNkt8BynuQxNIJP0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb6gUcI%2FbtsImidfQ8Q%2F6YNGpDnNkt8BynuQxNIJP0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1634&quot; height=&quot;658&quot; data-origin-width=&quot;1634&quot; data-origin-height=&quot;658&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;Inference단계에서는&amp;nbsp;&lt;/span&gt;face-LoRA model 과 style-LoRA model을 Stable Diffusion 에 통합합니다. 각 LoRA 모델 별로 각각 wegith를 줄 수 있습니다. 본 논문에서는 각각 0.25와 1.0으로 설정했습니다. 그 다음 stable diffusion모델을 이용해 입력 prompt에 맞게 이미지를 생성합니다. 생성 quality를 높히기 위해 Template Face Selection, Face Fusion, Similarity Ranking 등의 후처리도 진행됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.4 Model Post Processing&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- Template Face Selection: 입력 이미지들을 &lt;a href=&quot;https://www.modelscope.cn/models/iic/cv_manual_face-quality-assessment_fqa&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Face Quality Assessment (FQA)&lt;/a&gt; 모델을 이용해 quality score를 구합니다. 가장 높은 score를 가진 얼굴이 tamplate image가 됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- Face Fusion : &lt;a href=&quot;https://www.modelscope.cn/models/iic/cv_unet-image-face-fusion_damo&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Face Fusion모델&lt;/a&gt;을 이용해 생성된 이미지에 있는 얼굴을 template 얼굴과 fusion합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;Similarity Ranking: 최종적으로 생성된 이미지들은 template 이미지와 유사도를 계산합니다. 생성된 사진과 입력 이미지 사이의 고유한 통계적 차이를 고려해, &lt;a href=&quot;https://www.modelscope.cn/models/iic/cv_ir_face-recognition-ood_rts&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Random Temperature Scaling&lt;/a&gt; 모델을 이용해 facial similarity를 계산합니다. 마지막으로 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;similarity가 높은 이미지가 최종적으로 출력됩니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3. Result&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;943&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oXpmf/btsInVufsYU/BmfbvYxQpqyD7clPc3CO3k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oXpmf/btsInVufsYU/BmfbvYxQpqyD7clPc3CO3k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oXpmf/btsInVufsYU/BmfbvYxQpqyD7clPc3CO3k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FoXpmf%2FbtsInVufsYU%2FBmfbvYxQpqyD7clPc3CO3k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;522&quot; height=&quot;385&quot; data-origin-width=&quot;1280&quot; data-origin-height=&quot;943&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;별도의 코드 실행 없이 아래 사이트에서 바로 테스트 해 볼 수 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;a href=&quot;https://huggingface.co/spaces/modelscope/FaceChain&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://huggingface.co/spaces/modelscope/FaceChain&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1720064498799&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;FaceChain - a Hugging Face Space by modelscope&quot; data-og-description=&quot;&quot; data-og-host=&quot;huggingface.co&quot; data-og-source-url=&quot;https://huggingface.co/spaces/modelscope/FaceChain&quot; data-og-url=&quot;https://huggingface.co/spaces/modelscope/FaceChain&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/dz6wvO/hyWvPNG9dd/8ErDTX99nl7gG7A1YIiCj0/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648&quot;&gt;&lt;a href=&quot;https://huggingface.co/spaces/modelscope/FaceChain&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://huggingface.co/spaces/modelscope/FaceChain&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/dz6wvO/hyWvPNG9dd/8ErDTX99nl7gG7A1YIiCj0/img.png?width=1200&amp;amp;height=648&amp;amp;face=0_0_1200_648');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;FaceChain - a Hugging Face Space by modelscope&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;huggingface.co&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/Diffusion Personalization</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/164</guid>
      <comments>https://ga02-ailab.tistory.com/164#entry164comment</comments>
      <pubDate>Thu, 4 Jul 2024 12:45:00 +0900</pubDate>
    </item>
    <item>
      <title>[1] Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control</title>
      <link>https://ga02-ailab.tistory.com/163</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper]&lt;/span&gt; &lt;a href=&quot;https://arxiv.org/pdf/2405.12970&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2405.12970&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a href=&quot;https://github.com/FaceAdapter/Face-Adapter&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/FaceAdapter/Face-Adapter&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1718331645640&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - FaceAdapter/Face-Adapter&quot; data-og-description=&quot;Contribute to FaceAdapter/Face-Adapter development by creating an account on GitHub.&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/FaceAdapter/Face-Adapter&quot; data-og-url=&quot;https://github.com/FaceAdapter/Face-Adapter&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/cJwBOk/hyWlgL4Juo/126OTcpoU726sHCtjxMIU1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/FaceAdapter/Face-Adapter&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/FaceAdapter/Face-Adapter&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/cJwBOk/hyWlgL4Juo/126OTcpoU726sHCtjxMIU1/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - FaceAdapter/Face-Adapter&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Contribute to FaceAdapter/Face-Adapter development by creating an account on GitHub.&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;기존의 face reenactment와 swapping 은 GAN 모델을 많이 사용했습니다. 최근에는 GAN 대신 diffusion 모델을 많이 사용하는 추세인데요, 하지만 diffusion은 아래와 같은 여러 문제점이 존재합니다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 학습이 힘들다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 큰 pose 변화와 학습 중 배경에 대한 정보 부족으로 인한 blurry&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- text를 사용한 attribute 설정에 중점을 둠() =&amp;gt; image spatial control 약화, 얼굴과 자세를 제어하는데 많은 제약을 둠&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이러한 문제점을 해결하기 위해 &lt;b&gt;본 논문에서는 pre-trained diffusion 모델을 사용하여 high-precision, high-fidelity 를 갖는 Face-Adapter를 제안합니다.&lt;/b&gt; 이 Face-Adapter 는,&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 정확한 landmark와 background 를 제공하는 &lt;b&gt;Spatial Condition Generator&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- face embedding을 transformer decoder를 통해 text space 로 전환하는 &lt;b&gt;Plug-and-play(즉시 시작) Identity Encoder&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- spatial condition과 세부 속성들을 통합하는 &lt;b&gt;Attribute Controller &lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;를 포함합니다. 각 구성 요소 별 자세한 설명은 다음 절을 참고해주세요!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;i&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Method&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2256&quot; data-origin-height=&quot;1000&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/mEOxX/btsHUTLZyul/PbXUjN4GHF3NAuWfdaqxZ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/mEOxX/btsHUTLZyul/PbXUjN4GHF3NAuWfdaqxZ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/mEOxX/btsHUTLZyul/PbXUjN4GHF3NAuWfdaqxZ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FmEOxX%2FbtsHUTLZyul%2FPbXUjN4GHF3NAuWfdaqxZ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2256&quot; height=&quot;1000&quot; data-origin-width=&quot;2256&quot; data-origin-height=&quot;1000&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;본 논문에서 제안하는 Face-Adpater는&amp;nbsp; target 이미지의 motion에 해당하는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;포즈, 표정, 시선을 기반으로 &lt;/span&gt;&amp;nbsp;조명, 배경, 머리카락 등의 속성을 template이미지에 identity를&amp;nbsp; 통합하는 것을 목표로 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;본격적인 설명에 앞서 본 논문에서&amp;nbsp;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;face reenactment와 swapping 에서 정의하는 source와 target은 아래와 같습니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;face reenactment : target 이미지에서 자세와 표정만 가져와 source 이미지를 변화시키는 것(이목구비, 배경, 머리는 유지)&amp;nbsp;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;source =&amp;gt; 내가 쓸 이목구비, 배경, 머리스타일이 있는 이미지 / target =&amp;gt; 표정, 자세를 가져올 이미지&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt;face swapping : target 이미지에서 얼굴만 바꾸는것, 배경 + 표정 + 자세+헤어는 target 이미지와 동일&amp;nbsp;&lt;/span&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span style=&quot;color: #000000;&quot;&gt; source =&amp;gt; 이목구비 이미지 / target =&amp;gt; 템플릿 이미지&amp;nbsp;&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1674&quot; data-origin-height=&quot;312&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cbAgoy/btsHWCpbsS6/pORKmLVWcgXxer9ic4YNr1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cbAgoy/btsHWCpbsS6/pORKmLVWcgXxer9ic4YNr1/img.png&quot; data-alt=&quot;face reenactment와 swapping&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cbAgoy/btsHWCpbsS6/pORKmLVWcgXxer9ic4YNr1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcbAgoy%2FbtsHWCpbsS6%2FpORKmLVWcgXxer9ic4YNr1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;683&quot; height=&quot;127&quot; data-origin-width=&quot;1674&quot; data-origin-height=&quot;312&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;face reenactment와 swapping&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.1 Spatial Condition Generator&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;subsequent controlled generation을 보다 합리적이고 정확하게 생성하기 위한 모듈입니다. 본 모듈은 2개의 sub-module로 구성됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&amp;nbsp;2.1.1&amp;nbsp; 3D Landmark Projector&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;얼굴 형태 차이에 대한 문제를 해결하기 위해, 3D facial reconstruction 방법을 사용합니다. source 와 target 이미지에서 identity, expression을 개별적으로 추출하고 pose 계수를 구합니다. 그 다음, &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;source&lt;/span&gt;의 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;identity&lt;/span&gt; 계수를 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;target&lt;/span&gt;의 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;expression&lt;/span&gt; 및 pose 계수와 재결합하고 새로운 3D 얼굴을 재구성한 후 projection하여 l&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;andmark&lt;/span&gt;를 획득합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.1.2&amp;nbsp; Adapting Area Prediction&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;아래 그림에서 알 수 있듯이 배경은 시시각각 변합니다. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2208&quot; data-origin-height=&quot;422&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oCPaY/btsHVICYOHi/EvE46jRRauSzuy5Pfd7ua1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oCPaY/btsHVICYOHi/EvE46jRRauSzuy5Pfd7ua1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oCPaY/btsHVICYOHi/EvE46jRRauSzuy5Pfd7ua1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FoCPaY%2FbtsHVICYOHi%2FEvE46jRRauSzuy5Pfd7ua1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2208&quot; height=&quot;422&quot; data-origin-width=&quot;2208&quot; data-origin-height=&quot;422&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;만약 모델이 이러한 배경 변화에 대한 정보가 부족하다면 생성된 이미지들의 흐린 배경을 갖게 됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;face swapping의 경우 target background를 제공해주면 environmental lighting 와 spatial references에 대한 정보를 모델에 제공해 줄 수 있습니다. 이렇게 배경에 제약 사항을 추가함으로써 모델 학습을 좀 더 쉽게 만들고 conditional inpainting 으로 task를 축소 할 수 있습니다. 이러한 방식을 사용함으로써 모델은 배경 일관성을 유지하고 배경 consistency과 완벽하게 통합되는 이미지를 생성하는 것이 가능해졌습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #c1bef9;&quot;&gt;2.2 Identity Encoder&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;IP-Adapter-FaceID 및 InstantID에서 주장하는바와 같이 높은 수준의 face embedding은 보다 강하게 identity를 보존 할 수 있습니다. 본 논문에서 주장하는바는 face reenactment에는 무거운 texture encoder나 추가적인 identity network가 필요하지 않습니다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;구체적으로, 얼굴 이미지 I^S가 주어지면 pre-trained face recognition model인 E_id 를 통해서 face embedding f_id를 얻습니다. 그 다음, 3 layer transformer decoder인 ϕ_dec이용해 face embedding 값을 text semantic space에 projection 시켜 identity tokens을 구합니다. 이러한 과정을 거치기 때문에 pre-trained diffusion model인 U-Net을 face embedding 값을 얻기 위해 fine-tuning하지 않아도 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.3 Attribute Controller&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.3.1 Spatial Control&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;ControlNet에 맞춰 U-Net ф_Ctl의 복사본을 만들고 spatial control I_Sp를 조건 입력으로 추가합니다. spatial control 이미지들은 target motion landmarks인 I^T_lmk와&amp;nbsp; Adapting Area Predictor인 &amp;phi;_Re를 통해서 얻은 non-adapting&amp;nbsp;area를 결합하여 얻습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1520&quot; data-origin-height=&quot;262&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/DGxTa/btsHU6ZbjKy/WN4pmSF8kmYZ0JSfzmTETK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/DGxTa/btsHU6ZbjKy/WN4pmSF8kmYZ0JSfzmTETK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/DGxTa/btsHU6ZbjKy/WN4pmSF8kmYZ0JSfzmTETK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FDGxTa%2FbtsHU6ZbjKy%2FWN4pmSF8kmYZ0JSfzmTETK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;477&quot; height=&quot;82&quot; data-origin-width=&quot;1520&quot; data-origin-height=&quot;262&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.3.2 Attribute Template&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;배경에 대한 identity와 spatial control이 주어지면 attribute template은 조명, 배경, 머리카락을 포함하여 누락된 정보를 보완하도록 설계되었습니다. Attribute embeddings인 f_attr은 CLIP E_clip을 사용하여 attribute template에서 추출됩니다. 또한, loca/global feature를 모두 얻기 위해 patch token과 the global token을 모두 사용합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.4 Strategies for Boosting Performance&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.4.1 Training&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;1) Data Stream: face reenactmen와 face-swapping 모두 source와 target 이미지로 동일 인물의 서로 다른 pose 이미지 2장을 사용합니다. 하나의 모델로 두 개의 task가 가능하도록 하기 위해 50%의 확률로 두 개의 task에 해당하는 Data Stream을 선택하도록 합니다. Attribute&amp;nbsp;Controller 의 spatial control 와 attribute template은 위 구조도의 각각 빨간색과 파란색으로 표시된 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;Data Stream&lt;/span&gt;을 사용합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2) Condition Dropping for Classifier-free Guidance: drop되는 condition에는 U-Net&amp;nbsp;과&amp;nbsp;ControlNet 의 cross-attention에 입력되는 identity tokens 과 attribute token입니다. 이는 5% 확률로 진행됩니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;2.4.2 Inference&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;1) Adapting Area Predictor : face reeancment 의 경우 input은 source 이미지와 수정된 landmark이고 output은 adapting area입니다. swapping은 input이 target image이고 output이 adapting&amp;nbsp;area입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;2) Negative Prompt for Classifier-Free Guidance: &lt;span style=&quot;text-align: start;&quot;&gt;face reeancment 의 경우 Negative Prompt에 아무것도 입력되지 않습니다. &lt;span style=&quot;text-align: start;&quot;&gt;swapping은 target identity의 부정적인 영향을 줄이기 위해 target 이미지의 identity&amp;nbsp;tokens을 &lt;span style=&quot;text-align: start;&quot;&gt;Negative Prompt로 사용합니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;i&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;3. Experiments&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;i&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;3.1 Cross-identity face reenactment results on Voxceleb2&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/i&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2186&quot; data-origin-height=&quot;1372&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dFIviW/btsHW3zPc4u/tlGZneUYxQFRRAK6LFxY0k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dFIviW/btsHW3zPc4u/tlGZneUYxQFRRAK6LFxY0k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dFIviW/btsHW3zPc4u/tlGZneUYxQFRRAK6LFxY0k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdFIviW%2FbtsHW3zPc4u%2FtlGZneUYxQFRRAK6LFxY0k%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;755&quot; height=&quot;474&quot; data-origin-width=&quot;2186&quot; data-origin-height=&quot;1372&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #c1bef9;&quot;&gt;3.2 Face swapping qualitative comparison results on Voxceleb2 test set&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2332&quot; data-origin-height=&quot;1526&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cXnrFh/btsHWB4Q2Ko/RpzaKf3Kyrws0KmUasWer0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cXnrFh/btsHWB4Q2Ko/RpzaKf3Kyrws0KmUasWer0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cXnrFh/btsHWB4Q2Ko/RpzaKf3Kyrws0KmUasWer0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcXnrFh%2FbtsHWB4Q2Ko%2FRpzaKf3Kyrws0KmUasWer0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;771&quot; height=&quot;505&quot; data-origin-width=&quot;2332&quot; data-origin-height=&quot;1526&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/Diffusion Personalization</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/163</guid>
      <comments>https://ga02-ailab.tistory.com/163#entry163comment</comments>
      <pubDate>Wed, 12 Jun 2024 15:39:59 +0900</pubDate>
    </item>
    <item>
      <title>[dlib] dlib 설치시 에러</title>
      <link>https://ga02-ailab.tistory.com/162</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;dlib 설치 전 cmake 를 먼저 설치해줘야 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;하지만 cmake 를 설치 했음에도 불구하고 아래와 같은 에러 문구가 계속 뜬다면...&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1715328126188&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt; ModuleNotFoundError: No module named 'cmake'&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1620&quot; data-origin-height=&quot;844&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d56P5Q/btsHlyIqNgD/Audz4Kb0ufXOtnKk49kkZ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d56P5Q/btsHlyIqNgD/Audz4Kb0ufXOtnKk49kkZ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d56P5Q/btsHlyIqNgD/Audz4Kb0ufXOtnKk49kkZ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd56P5Q%2FbtsHlyIqNgD%2FAudz4Kb0ufXOtnKk49kkZ1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1620&quot; height=&quot;844&quot; data-origin-width=&quot;1620&quot; data-origin-height=&quot;844&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #f6e199;&quot;&gt;&amp;nbsp; [해결방법]&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. 먼저 설치했던 cmake를 모두 제거합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1715328231379&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;sudo apt-get remove cmake
sudo apt-get purge cmake
sudo apt remove cmake&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;which cmake&lt;/b&gt; 를 입력했을 때 아무것도 나오지 않아야 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. 아래 버전으로 cmake와 dlib을 새로 설치해줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1715328305146&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;pip install cmake==3.25.2
pip install dlib==19.24.2&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;[참고 사이트]&lt;/p&gt;
&lt;p data-ke-size=&quot;size14&quot;&gt;&lt;a href=&quot;https://github.com/davisking/dlib/issues/2943&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/davisking/dlib/issues/2943&lt;/a&gt;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/162</guid>
      <comments>https://ga02-ailab.tistory.com/162#entry162comment</comments>
      <pubDate>Fri, 10 May 2024 17:06:20 +0900</pubDate>
    </item>
    <item>
      <title>TypeError: Unable to convert function return value to a Python type! The signature was () -&amp;gt; handle</title>
      <link>https://ga02-ailab.tistory.com/161</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 전체 에러 문구&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1712040074640&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;TypeError: Unable to convert function return value to a Python type! The signature was () -&amp;gt; handle&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 해결 방법&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1712040267110&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;pip3 install numpy --upgrade&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/161</guid>
      <comments>https://ga02-ailab.tistory.com/161#entry161comment</comments>
      <pubDate>Tue, 2 Apr 2024 15:44:41 +0900</pubDate>
    </item>
    <item>
      <title>AttributeError: cannot assign module before Module.__init__() call</title>
      <link>https://ga02-ailab.tistory.com/160</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #f6e199;&quot;&gt;- 에러 문구&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709111079292&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;AttributeError: cannot assign module before Module.__init__() call&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;신경망 모델 설계시 super().__init__()을 빼먹어 생기는 문제입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;아래 처럼 코드에 추가해주면 해결 가능합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709111171982&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;class Block(nn.Module):
    def __init__(self, in_channel, hidden_channel, out_channel):
        super(CBlock, self).__init__()
        
        .
        .
        .&lt;/code&gt;&lt;/pre&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/160</guid>
      <comments>https://ga02-ailab.tistory.com/160#entry160comment</comments>
      <pubDate>Wed, 28 Feb 2024 18:06:31 +0900</pubDate>
    </item>
    <item>
      <title>[Docker] 내가 만든 container를 이미지로 만드는 방법</title>
      <link>https://ga02-ailab.tistory.com/159</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. commit 명령어를 이용하는 방법&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709085202544&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;docker commit [나의 container 이름] [새로운 repository]:[새로운 tag]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Dockerfile을 이용하는 방법&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Docker 환경 내에서 파일 이름이 Dockerfile인 파일을 생성하게 되면 내가 만든 이미지가 원래 어떤 이미지를 바탕으로 만들어 졌는지 어떤 명령어를 실행했는지 기록할 수 있습니다. Dockerfile 내에는 아래와 같은 내용을 적어주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709085792473&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;FROM [pull한 이미지]  ## 이 이미지를 pull 했다는 의미입니다.
RUN apt-get update &amp;amp;&amp;amp; apt-get install -y [설치하고싶은 라이브러리]  ## 이러한 명령을 실행했다는 의미입니다.


## example
FROM ubuntu ## ubuntu 이미지를 pull 했다는 의미입니다.
RUN apt-get update &amp;amp;&amp;amp; apt-get install -y git&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그리고 아래의 build 명령어를 실행해줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709085871906&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;docker build -t [새로운 repository]:[새로운 tag] [Dockerfile을 생성한 폴더 경로]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;번외로, 이렇게 만든 이미지를 압축 및 로드 하는 방법에 대해 알아보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 이미지 파일을 .tar 파일로 압축하기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709086108591&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;docker save -o [파일명.tar] [이미지 이름 또는 ID]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #f6e199;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- .tar파일을 image로 로드하기&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1709086161690&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;docker load -i [tar 파일이름]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Docker</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/159</guid>
      <comments>https://ga02-ailab.tistory.com/159#entry159comment</comments>
      <pubDate>Wed, 28 Feb 2024 11:10:25 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] OpenCV dnn을 이용해  딥러닝 모델 사용하기</title>
      <link>https://ga02-ailab.tistory.com/158</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이번 포스팅에서는 OpenCV&amp;nbsp;dnn을&amp;nbsp;이용해&amp;nbsp; 딥러닝&amp;nbsp;모델&amp;nbsp;사용하는 방법에 대해 알아보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 필요한 라이브러리들을 import 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1707359569671&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import cv2
import numpy as np
from matplotlib import pyplot as plt&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그 다음,&amp;nbsp; 사용할 이미지를 읽어줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1707361470023&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;image = cv2.imread(&quot;test.jpg&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;저는 이미지에서 얼굴을 탐지하는 작업을 해보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;사용할 모델은 RetinaFace의 caffe 모델인데 이를 opencv에서 사용하려면&amp;nbsp; prototxt 파일이 추가로 필요합니다. layer들을 정의해주는 파일이라고 생각하면 될 것 같습니다. opencv에서 모델파일과 prototxt 파일을 이용해 모델을 불러오려면 아래처럼 해주면 됩니다. 저는 caffe 모델을 사용할거기 때문에 cv2.dnn.readNetFromCaffe 함수를 사용하겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1707361860202&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;detector = cv2.dnn.readNetFromCaffe('deploy.prototxt', 'Widerface-RetinaFace.caffemodel')&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다음 cv2.dnn.blobFromImage를 이용해 입력 이미지를 전처리합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1707362345608&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;blob = cv2.dnn.blobFromImage(image, 1, mean=(104, 117, 123))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그 다음, 전처리한 이미지를 input으로 셋팅해주고 위에 셋팅한 모델에 넣어 추론을 진행하면 됩니다!&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1707362447538&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;detector.setInput(blob, 'data')
out = detector.forward('detection_out')&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;탐지된 얼굴 영역은 아래와 같습니다!&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1100&quot; data-origin-height=&quot;614&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uPieR/btsEB1k7gkM/DZy1H9dArLbBHTd6NE5KC1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uPieR/btsEB1k7gkM/DZy1H9dArLbBHTd6NE5KC1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uPieR/btsEB1k7gkM/DZy1H9dArLbBHTd6NE5KC1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuPieR%2FbtsEB1k7gkM%2FDZy1H9dArLbBHTd6NE5KC1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;545&quot; height=&quot;304&quot; data-origin-width=&quot;1100&quot; data-origin-height=&quot;614&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/158</guid>
      <comments>https://ga02-ailab.tistory.com/158#entry158comment</comments>
      <pubDate>Thu, 8 Feb 2024 12:35:36 +0900</pubDate>
    </item>
    <item>
      <title>[OpenCV] OpenCV를 이용한 이미지 warping</title>
      <link>https://ga02-ailab.tistory.com/157</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이번 포스팅에서는 opencv의&amp;nbsp; getPerspectiveTransform과 warpPerspective 함수를 이용한 이미지 warping 방법에 대해 알아보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 필요한 라이브러리를 import 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1706065635031&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;import cv2
import numpy as np&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;warping하고자 하는 이미지를 읽어줍니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1706065692214&quot; class=&quot;ini&quot; data-ke-type=&quot;codeblock&quot; data-ke-language=&quot;bash&quot;&gt;&lt;code&gt;image = cv2.imread('test.jpg')&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;변환하기 전 좌표와 변환 후의 좌표 값을 설정합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1706066394828&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;src_pts = np.array([[pts[0][0], pts[0][1]], [pts[1][0], pts[1][1]], [pts[2][0], pts[2][1]], [pts[3][0], pts[3][1]]], dtype=np.float32)
dst_pts = np.array([[0, 0], [w, 0], [w, h], [0, h]], dtype=np.float32)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;getPerspectiveTransform 함수를 이용해 변환 행렬을 계산합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1706066452766&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;trans_mat = cv2.getPerspectiveTransform(src_pts, dst_pts)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;계산된 변환행렬을 이용해 원근 변환을 적용합니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1706066513456&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;new_image = cv2.warpPerspective(image, trans_mat, (w, h))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;아래는 warping 전후의 이미지입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1886&quot; data-origin-height=&quot;692&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bxD1Tg/btsDTsxyhvI/DUKYOX3gnWsqxzxnCBJukk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bxD1Tg/btsDTsxyhvI/DUKYOX3gnWsqxzxnCBJukk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bxD1Tg/btsDTsxyhvI/DUKYOX3gnWsqxzxnCBJukk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbxD1Tg%2FbtsDTsxyhvI%2FDUKYOX3gnWsqxzxnCBJukk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1886&quot; height=&quot;692&quot; data-origin-width=&quot;1886&quot; data-origin-height=&quot;692&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>OpenCV</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/157</guid>
      <comments>https://ga02-ailab.tistory.com/157#entry157comment</comments>
      <pubDate>Wed, 24 Jan 2024 14:34:07 +0900</pubDate>
    </item>
    <item>
      <title>[10] CrossViT: Cross-Attention Multi-Scale Vision Transformer for ImageClassification</title>
      <link>https://ga02-ailab.tistory.com/156</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper] &lt;/span&gt;&lt;a href=&quot;https://openaccess.thecvf.com//content/ICCV2021/papers/Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://openaccess.thecvf.com//content/ICCV2021/papers/Chen_CrossViT_Cross-Attention_Multi-Scale_Vision_Transformer_for_Image_Classification_ICCV_2021_paper.pdf&lt;/a&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a href=&quot;https://github.com/IBM/CrossViT&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/IBM/CrossViT&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1703127801211&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - IBM/CrossViT: Official implementation of CrossViT. https://arxiv.org/abs/2103.14899&quot; data-og-description=&quot;Official implementation of CrossViT. https://arxiv.org/abs/2103.14899 - GitHub - IBM/CrossViT: Official implementation of CrossViT. https://arxiv.org/abs/2103.14899&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/IBM/CrossViT&quot; data-og-url=&quot;https://github.com/IBM/CrossViT&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/MT4h3/hyUPNxFQX9/GXinfYqe6EZvzszJLvSxj0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/IBM/CrossViT&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/IBM/CrossViT&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/MT4h3/hyUPNxFQX9/GXinfYqe6EZvzszJLvSxj0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - IBM/CrossViT: Official implementation of CrossViT. https://arxiv.org/abs/2103.14899&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Official implementation of CrossViT. https://arxiv.org/abs/2103.14899 - GitHub - IBM/CrossViT: Official implementation of CrossViT. https://arxiv.org/abs/2103.14899&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Abstract&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;최근 image classification 분야에서 CNN 보다 ViT가 더 나은 결과를 보이고 있습니다. 따라서 본 논문에서는&amp;nbsp; transformer모델이 multi-scale feature를 학습하는 방법을 제안합니다. 다양한 크기의 이미지 patch를 결합하는 dual-branch transformer를 제안하고 이때, 각각의 patch들이 가진 정보를 보완하기 위해 cross-attention을 사용합니다. 이러한 방법은 computational cost와 memory complexity를 선형정도로만 증가시키면서 높은 성능을 기록했다고 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;NLP분야에서 transformer가 큰 성공을 거두면서 vision 분야에서도 transformer가 CNN의 강력한 경쟁자로 떠올랐습니다. 이전의 대부분의 연구들은 self-attention과 CNN을 결합하는데 초점을 두었는데 이러한 방식은 계산에 scalability가 제한적입니다. 여러 연구 끝에 ViT(Vision Transformer)가 등장했는데 학습에 매우 큰 데이터 셋을 필요로 한다는 단점이 존재했습니다. 이 후에도 vision분야에 transformer를 적용하기 위한 노력이 연구가 계속되었습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;따라서, 본 논문에서는 &quot;이미지 분류 작업을 위한 multi-scale feature representations&quot; 방법을 tansformer에 적용하는 방식을 제안합니다. 이미지를 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;크고 작은 patch들로 나누고 &lt;/span&gt;두 개의 branch를 통해&amp;nbsp; 더 강력한 visual feature들을 생성합니다. 이 두 branch는 서로 다른 computational complexities 를 가지고 서로를 상호 보완하며 fuse 됩니다. fuse하는 방법으로는 cross attention을 사용합니다. 이를 통해 quadratic time이 아닌 linear-time정도의 시간 증가만 가져오게 됩니다. 아래는&amp;nbsp; DeiT와 ViT, 본 논문에서 제안하는 CrossViT 의 정확도와 FLOPs를 비교한 그림입니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;731&quot; data-origin-height=&quot;530&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/c0qbsL/btsCnIizyJA/3lPEBn5RzFOHIzYWh4jdIK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/c0qbsL/btsCnIizyJA/3lPEBn5RzFOHIzYWh4jdIK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/c0qbsL/btsCnIizyJA/3lPEBn5RzFOHIzYWh4jdIK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fc0qbsL%2FbtsCnIizyJA%2F3lPEBn5RzFOHIzYWh4jdIK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;436&quot; height=&quot;316&quot; data-origin-width=&quot;731&quot; data-origin-height=&quot;530&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. Method&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;제안하는 CrossViT모델은 기본적으로 ViT의 구조를 따릅니다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.1 Overview of Vision Transformer&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Vision Transformer에 대한 설명은 아래 글을 참고해 주세요!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;a href=&quot;https://ga02-ailab.tistory.com/147&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://ga02-ailab.tistory.com/147&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2.2 Proposed Multi-Scale Vision Transformer&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;850&quot; data-origin-height=&quot;1104&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dKUTrF/btsCq9ltmWk/AelSw7gTkBQvM5XM6vkqS1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dKUTrF/btsCq9ltmWk/AelSw7gTkBQvM5XM6vkqS1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dKUTrF/btsCq9ltmWk/AelSw7gTkBQvM5XM6vkqS1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdKUTrF%2FbtsCq9ltmWk%2FAelSw7gTkBQvM5XM6vkqS1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;463&quot; height=&quot;601&quot; data-origin-width=&quot;850&quot; data-origin-height=&quot;1104&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;이미지 patch의 크기는 정확도와 복잡도에 영향을 미칩니다. ViT의 경우에 보면 patch 크기가 16일때와 32일 때 , 16일때 정확도가 6% 더 높았지만 FLOPs는 4배더 많았다고 합니다. 본 논문에서는 정확도 향상과 FLOPs사이에 균형을 맞추면서 작은 patch크기를 유지 하기 위한 방법을 제시합니다. 그 중 첫번째가 dual-branch이고, 두번째가 두 branch 사이의 정보를 효과적으로 fuse하는 방법에 대한 것입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;위 구조도를 살펴보겠습니다. 제안하는 crossViT는 K 개의 K multiscale transformer encoder로 구성되어 있고 각각의 encoder는 2개의 branch로 L-Branch와 S-Branch로 구성되어 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;L-Branch: coarse-grained patch size를 이용합니다. 더 많은 encoder와 더 큰 embedding dimesion을 갖습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt; S-Branch: fine-grained patch size를 이용합니다. 적은 encoder 와 작은 embedding dimension을 갖습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;이 두 branch는 L번 fuse하게 되고 , 마지막에 CLS 토큰을 예측에 사용합니다. 또한, 각 branch의 token에 learnable position embedding 을 추가해 위치 정보를 학습할 수 있게 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.3 Multi-Scale Feature Fusion&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;이제, 효율적인 feature fusion 방법에 대해 알아보겠습니다. 이는,&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;multiscale feature representations에 중요한 키 역할을 하는데요. 본 논문에서는 아래 그림처럼 4가지 방식을 소개합니다. a-c는&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;heuristic 접근 법이고, d가 본 논문에서 제안하는 방식입니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1765&quot; data-origin-height=&quot;299&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/k5S64/btsCtqHrMzg/0DdUvwb25HSJvY2J22Kvvk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/k5S64/btsCtqHrMzg/0DdUvwb25HSJvY2J22Kvvk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/k5S64/btsCtqHrMzg/0DdUvwb25HSJvY2J22Kvvk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fk5S64%2FbtsCtqHrMzg%2F0DdUvwb25HSJvY2J22Kvvk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1765&quot; height=&quot;299&quot; data-origin-width=&quot;1765&quot; data-origin-height=&quot;299&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;앞으로 나올 수식들에 등장하는 &lt;i&gt;x^i&lt;/i&gt;는 branch의 시퀀스 데이터를 의미합니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.3.1 All-Attention Fusion&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그림(a)의 방식입니다. 가장 기본적인 방법으로 각 token의 성질은 고려하지 않고 모든 token들을 연결하고 self-attention 모듈을 통해 fuse하는 방법입니다. 간단하지만 계산 비용이 많이 든다는 문제점이 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;697&quot; data-origin-height=&quot;112&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b82krF/btsCtDUZbvN/kr3JyGFMy7knwdZ04rEkb1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b82krF/btsCtDUZbvN/kr3JyGFMy7knwdZ04rEkb1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b82krF/btsCtDUZbvN/kr3JyGFMy7knwdZ04rEkb1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb82krF%2FbtsCtDUZbvN%2Fkr3JyGFMy7knwdZ04rEkb1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;498&quot; height=&quot;80&quot; data-origin-width=&quot;697&quot; data-origin-height=&quot;112&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;i&gt;f( ), g( )&lt;/i&gt;는 projection과 back-projection을 나타내고, &lt;i&gt;z&lt;/i&gt;는 최종 output을 의미합니다.&amp;nbsp; 위 식을 해석해보자면 이렇습니다. &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;L-Branch와 S-Branch로 각각 output을 구해 &lt;i&gt;y&lt;/i&gt;를 얻고 이를 self-attention 모듈에 넣어 &lt;i&gt;o&lt;/i&gt;를 얻어냅니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.3.2 Class Token Fusion&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그림(b)의 방식입니다. CLS token은 추상적인 global feature representation로 여겨집니다. 그 이유는 최종 예측 단계에서만 사용되기 때문입니다. 그러므로&amp;nbsp;&lt;span&gt; 두 branch의 CLS token을 합산하여 fuse할 수도 있습니다. 이 방식은 하나의 tiken만 처리하면 되므로 매우 효율적입니다. CLS token이 fuse되면 이 정보는 transformer encoder의 patch token 입력이 됩니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;530&quot; data-origin-height=&quot;135&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/baK9Mw/btsCsg7gy4o/rs79ZJhlxA8s02bgCZU3o0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/baK9Mw/btsCsg7gy4o/rs79ZJhlxA8s02bgCZU3o0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/baK9Mw/btsCsg7gy4o/rs79ZJhlxA8s02bgCZU3o0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbaK9Mw%2FbtsCsg7gy4o%2Frs79ZJhlxA8s02bgCZU3o0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;342&quot; height=&quot;87&quot; data-origin-width=&quot;530&quot; data-origin-height=&quot;135&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.3.3 Pairwise Fusion&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그림(c)의 방식입니다.&lt;span&gt; patch token들은 이미지 공간상에 위치해 있기 때문에 각 patch의 spatial location을 기반으로 결합 할 수도 있습니다. 그런데 두 branch가 서로 다른 patch 사이즈를 가지기 때문에 patch의 수도 다를 수 밖에 없습니다. 따라서 먼저, interpolation을 통해&amp;nbsp; spatial size를 동일하게 해주고 fuse해줍니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;753&quot; data-origin-height=&quot;137&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/uMaxo/btsCu0CspqG/lftrKi9bEbpfhdWvEX6HKk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/uMaxo/btsCu0CspqG/lftrKi9bEbpfhdWvEX6HKk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/uMaxo/btsCu0CspqG/lftrKi9bEbpfhdWvEX6HKk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FuMaxo%2FbtsCu0CspqG%2FlftrKi9bEbpfhdWvEX6HKk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;402&quot; height=&quot;73&quot; data-origin-width=&quot;753&quot; data-origin-height=&quot;137&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;2.3.4 Cross-Attention Fusion&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그림(d)의 방식입니다.&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt; 본 논문에서 제안하고 있는 방식이죠! 이 방식의 fusion은 다른 branch의 CLS token을 서로의 branch와 공유합니다. 구체적으로, &amp;nbsp;multi-scale feature를 좀 더 효과적으로 fusion하기 위해 각 branch의 CLS token을 agent로서 이미지 patch와 정보를 교환한 다음 projection합니다. 이웃 branch로 가 그곳의 이미지 patch들의 정보를 학습한 후 자신의 branch로 돌아오는 것이죠. CLS token을 교환함으로써 branch들 사이의 patch token들의 정보를 교환하는 효과를 기대 할 수 있습니다. 아래 그림은 cross-attention module의 구조도입니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;362&quot; data-origin-height=&quot;466&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kynMu/btsCG3ei1UV/dGPkAQYYEEuoidLAW3M041/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kynMu/btsCG3ei1UV/dGPkAQYYEEuoidLAW3M041/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kynMu/btsCG3ei1UV/dGPkAQYYEEuoidLAW3M041/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkynMu%2FbtsCG3ei1UV%2FdGPkAQYYEEuoidLAW3M041%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;252&quot; height=&quot;324&quot; data-origin-width=&quot;362&quot; data-origin-height=&quot;466&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;l-branch의 경우 s-branch로 부터 patch정보를 모으로 CLS token 을 연결합니다. 이를 식으로 표현하면 아래와 같습니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;225&quot; data-origin-height=&quot;36&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cARogq/btsCAj3mtt8/smGqKn7Woest1StqOvfGTk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cARogq/btsCAj3mtt8/smGqKn7Woest1StqOvfGTk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cARogq/btsCAj3mtt8/smGqKn7Woest1StqOvfGTk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcARogq%2FbtsCAj3mtt8%2FsmGqKn7Woest1StqOvfGTk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;225&quot; height=&quot;36&quot; data-origin-width=&quot;225&quot; data-origin-height=&quot;36&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;i&gt;f&lt;/i&gt;는 projection 함수를 의미하고&lt;i&gt; x^'l&lt;/i&gt;은 small과 large의 CLS token을 projection하고 concat한 것을 의미합니다. 그 후 CLS token을 query로 하여 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: left;&quot;&gt;&lt;i&gt;x^'l&lt;/i&gt;와&lt;i&gt; x^l_cls&lt;/i&gt;로 cross attention 연산을 진행합니다. 이는 수식으로 아래와 같이 나타냅니다.&lt;/span&gt;&lt;/span&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;407&quot; data-origin-height=&quot;66&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/DkmHI/btsCEHo4fOK/6jk31SntUevqW2HuwQuAw1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/DkmHI/btsCEHo4fOK/6jk31SntUevqW2HuwQuAw1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/DkmHI/btsCEHo4fOK/6jk31SntUevqW2HuwQuAw1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FDkmHI%2FbtsCEHo4fOK%2F6jk31SntUevqW2HuwQuAw1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;339&quot; height=&quot;55&quot; data-origin-width=&quot;407&quot; data-origin-height=&quot;66&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;위 식의 W 들은 학습 가능한 파라미터들입니다. &lt;i&gt;C&lt;/i&gt;는 &lt;span style=&quot;background-color: #ffffff; color: #212529; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;embedding dimension, &lt;i&gt;h&lt;/i&gt;는 &lt;span style=&quot;background-color: #ffffff; color: #212529; text-align: start;&quot;&gt;num of heads입니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;background-color: #ffffff; color: #212529; text-align: start;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; color: #212529; text-align: start;&quot;&gt;그 이후의 과정은 self-attention과 유사하게 multi-head와 LayerNorm을 사용합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;416&quot; data-origin-height=&quot;74&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/7HgGy/btsCzTw2dQi/YpVXIG4qafFMjVu2zawk50/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/7HgGy/btsCzTw2dQi/YpVXIG4qafFMjVu2zawk50/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/7HgGy/btsCzTw2dQi/YpVXIG4qafFMjVu2zawk50/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F7HgGy%2FbtsCzTw2dQi%2FYpVXIG4qafFMjVu2zawk50%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;416&quot; height=&quot;74&quot; data-origin-width=&quot;416&quot; data-origin-height=&quot;74&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;이 때,&lt;i&gt; f &lt;/i&gt;와 &lt;i&gt;g&lt;/i&gt;는 각각 &lt;span style=&quot;background-color: #ffffff; color: #212529; text-align: start;&quot;&gt;projection, back-projection 함수입니다.&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;3. Experiments&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;3.1 Comparisons with DeiT&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;600&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/VsXgz/btsCuPadFUC/1lgu55VaglvZKw42IrkyMk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/VsXgz/btsCuPadFUC/1lgu55VaglvZKw42IrkyMk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/VsXgz/btsCuPadFUC/1lgu55VaglvZKw42IrkyMk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FVsXgz%2FbtsCuPadFUC%2F1lgu55VaglvZKw42IrkyMk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;470&quot; height=&quot;334&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;600&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 DeiT와 비교입니다. CrossViT가 더 높은 성능을 보이고 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.2 Comparisons with SOTA Transformers&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;591&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b9NaPW/btsCuN4vg3R/vjOF7BFgbVdOiLBRdFHKUk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b9NaPW/btsCuN4vg3R/vjOF7BFgbVdOiLBRdFHKUk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b9NaPW/btsCuN4vg3R/vjOF7BFgbVdOiLBRdFHKUk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb9NaPW%2FbtsCuN4vg3R%2FvjOF7BFgbVdOiLBRdFHKUk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;473&quot; height=&quot;331&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;591&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다른 transformer 모델들과의 비교에서도 CrossViT가 가장 높은 성능을 보이고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start; background-color: #ffc1c8;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.3 Comparisons with CNN-based Models&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;936&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bmu3Es/btsCwKM2Xfv/dKeaxhJCsRWhy5R1OvPt3K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bmu3Es/btsCwKM2Xfv/dKeaxhJCsRWhy5R1OvPt3K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bmu3Es/btsCwKM2Xfv/dKeaxhJCsRWhy5R1OvPt3K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbmu3Es%2FbtsCwKM2Xfv%2FdKeaxhJCsRWhy5R1OvPt3K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;532&quot; height=&quot;589&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;936&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이번엔 CNN 모델들과의 비교입니다. CrossViT 또한 CNN과 거의 비슷한 성능을 내고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: left;&quot; data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #ffc1c8;&quot;&gt;3.4 Ablation Study&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3.4.1 Comparison of Different Fusion Schemes&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;258&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bOlqwV/btsCyyEIkp5/qDu1OR3vBCU2ZRQS7Viw8K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bOlqwV/btsCyyEIkp5/qDu1OR3vBCU2ZRQS7Viw8K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bOlqwV/btsCyyEIkp5/qDu1OR3vBCU2ZRQS7Viw8K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbOlqwV%2FbtsCyyEIkp5%2FqDu1OR3vBCU2ZRQS7Viw8K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;531&quot; height=&quot;162&quot; data-origin-width=&quot;845&quot; data-origin-height=&quot;258&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다양한 fusion 방식들에 대한 성능입니다. cross-attention 방법이 가장 높은 성능을 내고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&amp;nbsp;&lt;/span&gt;&lt;/p&gt;</description>
      <category>Paper Review/etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/156</guid>
      <comments>https://ga02-ailab.tistory.com/156#entry156comment</comments>
      <pubDate>Tue, 26 Dec 2023 21:30:22 +0900</pubDate>
    </item>
    <item>
      <title>[9] Supervised Contrastive Learning</title>
      <link>https://ga02-ailab.tistory.com/155</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper] &lt;a href=&quot;https://arxiv.org/pdf/2004.11362.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://arxiv.org/pdf/2004.11362.pdf&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a href=&quot;https://github.com/HobbitLong/SupContrast&quot; target=&quot;_blank&quot; rel=&quot;noopener&amp;nbsp;noreferrer&quot;&gt;https://github.com/HobbitLong/SupContrast&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1699494669822&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - HobbitLong/SupContrast: PyTorch implementation of &amp;quot;Supervised Contrastive Learning&amp;quot;  (and SimCLR incidentally)&quot; data-og-description=&quot;PyTorch implementation of &amp;quot;Supervised Contrastive Learning&amp;quot; (and SimCLR incidentally) - GitHub - HobbitLong/SupContrast: PyTorch implementation of &amp;quot;Supervised Contrastive Learning&amp;amp;q...&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/HobbitLong/SupContrast&quot; data-og-url=&quot;https://github.com/HobbitLong/SupContrast&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bNdIqf/hyUrAGfxkT/8AKESigcasrsiOLE0o6hgK/img.png?width=1200&amp;amp;height=600&amp;amp;face=978_174_1030_232&quot;&gt;&lt;a href=&quot;https://github.com/HobbitLong/SupContrast&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/HobbitLong/SupContrast&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bNdIqf/hyUrAGfxkT/8AKESigcasrsiOLE0o6hgK/img.png?width=1200&amp;amp;height=600&amp;amp;face=978_174_1030_232');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - HobbitLong/SupContrast: PyTorch implementation of &quot;Supervised Contrastive Learning&quot; (and SimCLR incidentally)&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;PyTorch implementation of &quot;Supervised Contrastive Learning&quot; (and SimCLR incidentally) - GitHub - HobbitLong/SupContrast: PyTorch implementation of &quot;Supervised Contrastive Learning&amp;amp;q...&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이제까지 딥러닝 분류모델에서 가장 많이 사용되던 loss함수는 cross-entropy 였습니다. 하지만 cross-entropy는 noisy label에 대한 robustness 가 부족하고 학습 시 margin을 추가할 수 없어 성능 저하를 일으킨다는 문제가 있습니다. 이를 해결하기 위해 다양한 새로운 loss 함수들이 제안되었지만, large-scale dataset에서는 잘 동작하지 않는 점은 여전히 문제로 남아 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이러한 단점을 해결하기 위해 등장한 것이 바로 self-supervised contrastive 입니다.&amp;nbsp; 이 논문에서는 supervised contrastive 를 제안하고 있는데요, 둘의 차이점을 아래 그림을 통해 간략히 설명하겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2104&quot; data-origin-height=&quot;952&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bFxp61/btszZI5uc7l/5AWVdRcjBbiGE4GawuEKQ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bFxp61/btszZI5uc7l/5AWVdRcjBbiGE4GawuEKQ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bFxp61/btszZI5uc7l/5AWVdRcjBbiGE4GawuEKQ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbFxp61%2FbtszZI5uc7l%2F5AWVdRcjBbiGE4GawuEKQ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2104&quot; height=&quot;952&quot; data-origin-width=&quot;2104&quot; data-origin-height=&quot;952&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 self-supervised contrastive learning 입니다. 이 방법은 label이 없는 large-scale dataset을 잘 학습시키위해 등장한 방법입니다.&amp;nbsp; label 없이 의미 있는 표현을 학습하도록 합니다. representation learning의 한 종류이죠. 그 과정은 아래와 같습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;step 1) 학습 데이터(anchor) 하나를 설정하고,&amp;nbsp; data augmentation을 진행합니다. 이 데이터들은 positive data가 됩니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;step 2) 나머지 이미지들을 negative data로 설정하고, 학습을 진행합니다.&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;step 3) positive와 negative가 embedding vector space에서 분리되며 학습이 진행됩니다.&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;하지만 이 과정에는 한가지 문제점이 존재합니다. negative를 설정하는 기준이 &quot;다른 클래스&quot;가 아니라 &quot;다른 이미지&quot;이기 때문에 같은&amp;nbsp; class의 사진도 negative로 분류되어 버립니다. 그렇기 때문에 pretrain후 fine-tuning이 어렵고 추가학습 진행이 어렵습니다. 이는&amp;nbsp; label이 없는 self-supervised 방식이기 때문에 발생하는 문제입니다. &lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이를 해결하기 위해 등장한 것이 본 논문에서 제안하는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: center;&quot;&gt;supervised contrastive learning&lt;span&gt; 입니다. 정답 label을 알 수 있기 때문에 앞서 발생한 문제들을 해결할 수 있습니다. step 1의 data augmentation은 유지하고, 같은 label들끼리는 유사한 representation을 얻도록 학습하게 됩니다. 즉, positive 들 끼리는 &quot;pull together&quot;, negative들 끼리는 &quot;push apart&quot;하게 학습하는 것이죠. 또한,&amp;nbsp; cross-entropy를 사용하는 경우에는 representation과 decision boundary를 동시에 학습했지만 본 논문에서는 따로 학습하는 방법을 사용했습니다. 이러한 과정들을 논문에서는 SupCon이라고&amp;nbsp; 이름 지었네요. 이제 SupCon의 구체적인 방법에 대해 알아보겠습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;2.Method&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8; color: #000000;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;text-align: center;&quot;&gt;2.1 Representation Learning Framework&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;- Augmentation Module(&lt;i&gt;Aug(x)&lt;/i&gt;) :&amp;nbsp; 데이터를 augmemtation하는 모듈입니다. 데이터의 다양한 패턴을 반영할 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;- Encoder Module(&lt;i&gt;Enc(x)&lt;/i&gt;) : CNN을 적용해 feature를 추출합니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;- Projection head Module(&lt;i&gt;proj(x)&lt;/i&gt;) : MLP 로 구성되어 있고, L2 normalization하여 feature 를 추출합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;2.2 Contrastive Loss Function&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;먼저 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;self-supervised contrastive&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt; loss function 의 수식을 살펴보겠습니다. 논문에 &amp;ldquo;multiviewed batch&amp;rdquo;라는 단어가 등장하는데, 이는 augmentation된 2N개의 샘플들을 말합니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;2.2.1 Self-Supervised Contrastive Loss&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1182&quot; data-origin-height=&quot;188&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/BDfoc/btsz5o6ea8c/3NiIiMueX6WeumrgAURQZ0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/BDfoc/btsz5o6ea8c/3NiIiMueX6WeumrgAURQZ0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/BDfoc/btsz5o6ea8c/3NiIiMueX6WeumrgAURQZ0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FBDfoc%2Fbtsz5o6ea8c%2F3NiIiMueX6WeumrgAURQZ0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;747&quot; height=&quot;119&quot; data-origin-width=&quot;1182&quot; data-origin-height=&quot;188&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;각 기호가 의미하는 바는 아래와 같습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1403&quot; data-origin-height=&quot;383&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eF7kLx/btsAM6dvUwK/VrEHsVnBGkS0J2zJvNVqj0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eF7kLx/btsAM6dvUwK/VrEHsVnBGkS0J2zJvNVqj0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eF7kLx/btsAM6dvUwK/VrEHsVnBGkS0J2zJvNVqj0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeF7kLx%2FbtsAM6dvUwK%2FVrEHsVnBGkS0J2zJvNVqj0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1403&quot; height=&quot;383&quot; data-origin-width=&quot;1403&quot; data-origin-height=&quot;383&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;또한, 학습시에는 위 식의 분자는 최대, 분모는 최소가 되도록 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;2.2.2 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: center;&quot;&gt;Supervised Contrastive Loss&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1226&quot; data-origin-height=&quot;197&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dAewZj/btsAOMr9qVb/LmFZi5Gz0XNKshYLvytxS0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dAewZj/btsAOMr9qVb/LmFZi5Gz0XNKshYLvytxS0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dAewZj/btsAOMr9qVb/LmFZi5Gz0XNKshYLvytxS0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdAewZj%2FbtsAOMr9qVb%2FLmFZi5Gz0XNKshYLvytxS0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1226&quot; height=&quot;197&quot; data-origin-width=&quot;1226&quot; data-origin-height=&quot;197&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;Self-Supervised Contrastive Loss와 비교해서 달라진 점은 파란 네모 박스 부분입니다. &amp;nbsp;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: center;&quot;&gt;각 기호가 의미하는 바는 아래와 같습니다.&lt;/span&gt; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1407&quot; data-origin-height=&quot;419&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bqR49w/btsAPic33cA/NRIHbmZq9QIGhMR1xNkSN1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bqR49w/btsAPic33cA/NRIHbmZq9QIGhMR1xNkSN1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bqR49w/btsAPic33cA/NRIHbmZq9QIGhMR1xNkSN1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbqR49w%2FbtsAPic33cA%2FNRIHbmZq9QIGhMR1xNkSN1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1407&quot; height=&quot;419&quot; data-origin-width=&quot;1407&quot; data-origin-height=&quot;419&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;2.2.3 Connection to Triplet Loss and N-pairs Loss&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: center;&quot;&gt;Supervised contrastive learning은 triplet loss와 밀접한 관련이 있습니다. &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: center;&quot;&gt;contrastive &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: center;&quot;&gt;loss에서 하나의 negative와 하나의 positive가 사용될 때 triplet loss가 됩니다. 또한 하나 이상의 negative가 사용되면 N-pairs loss와 동일해집니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3. Experiments&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다양한 loss 함수들과 비교했을 때 제안하는 loss 함수가 높은 성능을 보이고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;844&quot; data-origin-height=&quot;593&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/YhKjf/btsAMkpvwyN/rushxsT8bA5cD19fK0X0kK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/YhKjf/btsAMkpvwyN/rushxsT8bA5cD19fK0X0kK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/YhKjf/btsAMkpvwyN/rushxsT8bA5cD19fK0X0kK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FYhKjf%2FbtsAMkpvwyN%2FrushxsT8bA5cD19fK0X0kK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;697&quot; height=&quot;490&quot; data-origin-width=&quot;844&quot; data-origin-height=&quot;593&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/etc</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/155</guid>
      <comments>https://ga02-ailab.tistory.com/155#entry155comment</comments>
      <pubDate>Thu, 23 Nov 2023 21:51:36 +0900</pubDate>
    </item>
    <item>
      <title>RuntimeError: Unable to find a valid cuDNN algorithm to run convolution</title>
      <link>https://ga02-ailab.tistory.com/154</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;- 전체 에러 문구&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1700393371200&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;RuntimeError: Unable to find a valid cuDNN algorithm to run convolution&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;-&amp;nbsp; 해결 방법&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;GPU 메모리 부족으로 발생하는 에러입니다. batch size를 작게 조절해주면 해결!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/154</guid>
      <comments>https://ga02-ailab.tistory.com/154#entry154comment</comments>
      <pubDate>Sun, 19 Nov 2023 20:29:51 +0900</pubDate>
    </item>
    <item>
      <title>TypeError: only integer scalar arrays can be converted to a scalar index</title>
      <link>https://ga02-ailab.tistory.com/153</link>
      <description>&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;- 전체 에러 문구&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;TypeError:&amp;nbsp;only&amp;nbsp;integer&amp;nbsp;scalar&amp;nbsp;arrays&amp;nbsp;can&amp;nbsp;be&amp;nbsp;converted&amp;nbsp;to&amp;nbsp;a&amp;nbsp;scalar&amp;nbsp;index&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;여러 개의 numpy array를 concat 할 때 발생하는 에러입니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;concatenate()&lt;/b&gt; 함수에 tuple 형식의 matrix을 사용하지 않았기 때문에 발생하는 에러입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 해결 방법&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1699334948300&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;np.concatenate((y_h, cb_h, cr_h))&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;연결하고자 하는 행렬들을 tuple로 묶어주면 됩니다.&lt;/span&gt;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/153</guid>
      <comments>https://ga02-ailab.tistory.com/153#entry153comment</comments>
      <pubDate>Tue, 7 Nov 2023 16:30:42 +0900</pubDate>
    </item>
    <item>
      <title>[2] Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues</title>
      <link>https://ga02-ailab.tistory.com/152</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[Paper] &lt;a href=&quot;https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123570086.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123570086.pdf&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;[Github] &lt;a href=&quot;https://github.com/yyk-wew/F3Net&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://github.com/yyk-wew/F3Net&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1697420780614&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - yyk-wew/F3Net: Pytorch implementation of F3Net (ECCV 2020 F3Net: Frequency in Face Forgery Network)&quot; data-og-description=&quot;Pytorch implementation of F3Net (ECCV 2020 F3Net: Frequency in Face Forgery Network) - GitHub - yyk-wew/F3Net: Pytorch implementation of F3Net (ECCV 2020 F3Net: Frequency in Face Forgery Network)&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/yyk-wew/F3Net&quot; data-og-url=&quot;https://github.com/yyk-wew/F3Net&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/busmNf/hyUdOxjvv1/7XrGuaHL5nPligiaSMqPnK/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/yyk-wew/F3Net&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/yyk-wew/F3Net&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/busmNf/hyUdOxjvv1/7XrGuaHL5nPligiaSMqPnK/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - yyk-wew/F3Net: Pytorch implementation of F3Net (ECCV 2020 F3Net: Frequency in Face Forgery Network)&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Pytorch implementation of F3Net (ECCV 2020 F3Net: Frequency in Face Forgery Network) - GitHub - yyk-wew/F3Net: Pytorch implementation of F3Net (ECCV 2020 F3Net: Frequency in Face Forgery Network)&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Abstract&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;최근 deepfake 기술이 정교해져 악성 사용이 증가하고 있는 만큼 본 논문에서는 이미지의 DCT frequency(주파수) 정보를 이용해 deepfake 이미지를 판별해내는 방법을 제안합니다. 이미지 내에서 아래 두 가지의 frequency 단서를 추출합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- frequency-aware&amp;nbsp;decomposed&amp;nbsp;image&amp;nbsp;components&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- local frequency statistics&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 두 가지의 frequency 패턴을 two-stream으로 collaborative learning하여 deepfake 이미지를 좀 더 정확하게 분류해낼 수 있다고 합니다!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;deepfake detection을 위한 초기의 연구들은 local pattern analysis, noise variances evaluation, steganalysis features 와 같은 hand-crafted features 들을 이용했습니다. 최근 연구들은 CNN을 사용하고 있구요.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;하지만 최근 생성 모델의 발전으로 아래 그림 1(a)에서 확인 할 수 있듯이 위조된 이미지를 판별하기란 더욱 어려워졌습니다. (HQ: high quality,&amp;nbsp; LQ: low quality)&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2120&quot; data-origin-height=&quot;582&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bGxvm7/btsyDv6G31i/U6IfK8M8JkkQsHpMtArNRK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bGxvm7/btsyDv6G31i/U6IfK8M8JkkQsHpMtArNRK/img.png&quot; data-alt=&quot;그림 1&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bGxvm7/btsyDv6G31i/U6IfK8M8JkkQsHpMtArNRK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbGxvm7%2FbtsyDv6G31i%2FU6IfK8M8JkkQsHpMtArNRK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2120&quot; height=&quot;582&quot; data-origin-width=&quot;2120&quot; data-origin-height=&quot;582&quot;/&gt;&lt;/span&gt;&lt;figcaption&gt;그림 1&lt;/figcaption&gt;
&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;게다가 'jpeg'나 'H.264'처럼 압축률이 큰 파일 형식을 이용하는 경우 이미지의 화질이 깨지게 되면서 위조 단서를 파악하기가 더 어려워지게 됩니다. 하지만 이는 이미지의 frequency를 이용하면 해결 할 수 있습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;그렇다면 이러한 frequency 정보를 어떻게 CNN에 포함시키는 것이 가장 효과적일까요? 기존에 많이 사용하던 FFT나 DCT 주파수 정보는 shift-invariance 와 local consistency가 일치하지 않아 vanilla CNN에서는 적절히 학습되기 힘들다고 합니다. 따라서 본 논문에서는 차별적인 아래 두 가지의 frequency 정보를 제안합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;- FAD(Frequency-aware Decomposition)&lt;/b&gt; : 이미지는 주파수에 따라 분리하는 것이 가능하기 때문에 그림 1(b)에서와 같이 real/fake에 따라&amp;nbsp; frequency가 특정한 형태로 나타나게 됩니다. 이는 CNN에서도 효과적으로 작동하고 매우 강력하다고 합니다!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;- LFS(Local Frequency Statistics)&lt;/b&gt; :&amp;nbsp; 이미지의 local frequency 통계 정보를 이용하는 것입니다. 역시&amp;nbsp;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그림 1(b)을 확인해보면 real과 fake사이에 뚜렷한 통계 차이가 존재하는 것을 알 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;위 두 가지 정보는 상호 보완적이기 때문에 cross-attention모듈(MixBlock)을 통해 융합될 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;따라서 본 논문에서는 이러한 요소들을 활용한 &lt;b&gt;F3-Net &lt;/b&gt;(&lt;b&gt;Frequency in Face Forgery Network)&lt;/b&gt;&amp;nbsp;모델을 제안합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2. Our Approach&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;아래는 F3-Net 의 전체 구조도입니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1984&quot; data-origin-height=&quot;654&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/wBE8O/btsyuiOtrBB/zto9zHtjVyHaELotJ6V210/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/wBE8O/btsyuiOtrBB/zto9zHtjVyHaELotJ6V210/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/wBE8O/btsyuiOtrBB/zto9zHtjVyHaELotJ6V210/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FwBE8O%2FbtsyuiOtrBB%2Fzto9zHtjVyHaELotJ6V210%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1984&quot; height=&quot;654&quot; data-origin-width=&quot;1984&quot; data-origin-height=&quot;654&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.1 FAD: Frequency-Aware Decomposition&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;아래는 FAD의 구조도입니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;2338&quot; data-origin-height=&quot;536&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bjuBei/btsyDAtoCY1/Okarl3IbLYRe83dCGgfKd1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bjuBei/btsyDAtoCY1/Okarl3IbLYRe83dCGgfKd1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bjuBei/btsyDAtoCY1/Okarl3IbLYRe83dCGgfKd1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbjuBei%2FbtsyDAtoCY1%2FOkarl3IbLYRe83dCGgfKd1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;2338&quot; height=&quot;536&quot; data-origin-width=&quot;2338&quot; data-origin-height=&quot;536&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;본 논문에서는 학습 가능한 frequency filter를 제안하고 이 frequency 영역에 따라 이미지를 분할하는 방법인 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;FAD(Frequency-Aware Decomposition)를 제안합니다. 분해된 frequency 성분들은 채널 축을 따라 stack 한 다음 Xception을 이용해 fake와 real에 해당하는 패턴을 종합적으로 뽑아냅니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;좀 더 자세히 설명해보겠습니다. 본 논문에서는 여러 개의 frequency filter를 사용합니다. 그 중 일부는 저/중/고 frequency대역으로 분할하는 N개의 binary base filter 입니다. 이는 아래와 같이 표현됩니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;121&quot; data-origin-height=&quot;46&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/IMeNQ/btszrJW0jFp/k1mLmBoRWwGuQOI42i8Hw0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/IMeNQ/btszrJW0jFp/k1mLmBoRWwGuQOI42i8Hw0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/IMeNQ/btszrJW0jFp/k1mLmBoRWwGuQOI42i8Hw0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FIMeNQ%2FbtszrJW0jFp%2Fk1mLmBoRWwGuQOI42i8Hw0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;100&quot; height=&quot;38&quot; data-origin-width=&quot;121&quot; data-origin-height=&quot;46&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;그 다음 3개의 학습 가능한 필터를 추가합니다. 이는 아래와 같이 표현됩니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;99&quot; data-origin-height=&quot;43&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bIeQrs/btszkP5vpQb/dn6UAxDGveCHS0PrwrXOYk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bIeQrs/btszkP5vpQb/dn6UAxDGveCHS0PrwrXOYk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bIeQrs/btszkP5vpQb/dn6UAxDGveCHS0PrwrXOYk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbIeQrs%2FbtszkP5vpQb%2Fdn6UAxDGveCHS0PrwrXOYk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;81&quot; height=&quot;35&quot; data-origin-width=&quot;99&quot; data-origin-height=&quot;43&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;이 두 frequency 필터는 아래 식으로 결합됩니다.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;353&quot; data-origin-height=&quot;39&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ccL5oz/btszjHmDj28/6sJDpLIySdQAYcDxnxopKk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ccL5oz/btszjHmDj28/6sJDpLIySdQAYcDxnxopKk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ccL5oz/btszjHmDj28/6sJDpLIySdQAYcDxnxopKk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FccL5oz%2FbtszjHmDj28%2F6sJDpLIySdQAYcDxnxopKk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;299&quot; height=&quot;33&quot; data-origin-width=&quot;353&quot; data-origin-height=&quot;39&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;이때, &amp;sigma;는 아래 값에 해당합니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;228&quot; data-origin-height=&quot;47&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bQxQz6/btszsCDyJ4S/wGxRRNk4RtBinfz3fTsYs1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bQxQz6/btszsCDyJ4S/wGxRRNk4RtBinfz3fTsYs1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bQxQz6/btszsCDyJ4S/wGxRRNk4RtBinfz3fTsYs1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbQxQz6%2FbtszsCDyJ4S%2FwGxRRNk4RtBinfz3fTsYs1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;175&quot; height=&quot;36&quot; data-origin-width=&quot;228&quot; data-origin-height=&quot;47&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;이렇게 하여 입력 이미지를 -1과 1 사이의 값으로 압축하는 것을 목표로 합니다. 이렇게 입력 이미지 &lt;i&gt;X&lt;/i&gt;에 대해 분해된 이미지 component들은 다음과 같이 얻을 수 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;631&quot; data-origin-height=&quot;48&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dcgPaI/btszj4PDneE/T4yrIKRtZGlVE5LCrR1ark/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dcgPaI/btszj4PDneE/T4yrIKRtZGlVE5LCrR1ark/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dcgPaI/btszj4PDneE/T4yrIKRtZGlVE5LCrR1ark/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdcgPaI%2Fbtszj4PDneE%2FT4yrIKRtZGlVE5LCrR1ark%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;539&quot; height=&quot;41&quot; data-origin-width=&quot;631&quot; data-origin-height=&quot;48&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;⊙는 element-wise product에 해당하고 &lt;i&gt;D&lt;/i&gt;는 Discrete Cosine Transform (DCT)입니다.&amp;nbsp; DCT는 저주파수는 왼쪽 상단 모서리에, 고주파수는 오른쪽 하단 모서리에 배치되니다. JPEG나 H.264와 같은 이미지 압축 알고리즘에서&amp;nbsp; 많이 사용되기 때문에 DCT를 사용하는 FAD는 fake 패턴을 더 잘 표현 할 수 있습니다. 기본필터 f_base를 저~고주파까지 동일한 에너지를 갖는 N개의 대역으로 나누고, 학습 가능한 필터인 f_w 는 위 FAD 구조도에서 확인할 수 있듯이 밴드수는 3을 사용합니다. 이때, 저주파 밴드는 전체 스펙트럼의 1/16, 중주파 밴드는 1/16 ~ 1/8 사이, 고주파 밴드는 마지막 7/8에 해당합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.2 LFS: Local Frequency Statistics&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;아래는 LFS의 구조도입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;991&quot; data-origin-height=&quot;234&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d0EzCu/btszqfh5fQI/K4T6kXP0eRXu3rCdqV8k31/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d0EzCu/btszqfh5fQI/K4T6kXP0eRXu3rCdqV8k31/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d0EzCu/btszqfh5fQI/K4T6kXP0eRXu3rCdqV8k31/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd0EzCu%2Fbtszqfh5fQI%2FK4T6kXP0eRXu3rCdqV8k31%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;729&quot; height=&quot;172&quot; data-origin-width=&quot;991&quot; data-origin-height=&quot;234&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;앞서 설명한 FAD는 CNN의 입력으로 활용할 수 있는 정보를 일부 제공하도록 학습하지만, 이 정보를 그대로 활용하는 것이 아닌 spatial domain으로 변환해야하기 때문에 frequency 정보를 직접 활용하지 못했습니다. 또한, DCT 스펙트럼으로 부터 인간이 직접 CNN feature를 추출하여 fake feature를 추출하는 것은 어렵기 때문에 본 논문에서는 frequency 통계 랜더링과 더불어 shift-invariance와 local consistency 를 일치 시키기 위해 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;LFS(Local Frequency Statistics)를 제안합니다. 이러한 통계 정보는 역시 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;Xception을 이용해 fake 패턴을 찾아냅니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;위 구조도 (a) 에서 알 수 있듯이 먼저 입력 이미지에 Sliding Window DCT (SWDCT)를 적용하여 지역적인 주파수 response를 뽑습니다. 그 다음 학습가능한 frequency band내에서 평균 frequency response를 계산합니다. 이러한 frequency 통계는 입력 이미지와 동일한 레이아웃을 공유하는 multi-channel&amp;nbsp;spatial&amp;nbsp;map으로 re-assemble됩니다. 추가로, LFS는 비정상적인 주파수를 감지하기 위해 localized aperture(구멍, 조리개...?) 를 사용합니다. 이를 통해서 일련의 frequency bands 내에서 이상치의 간섭없이 더 부드러운 분포를 찾아낼 수 있는 이점을 얻을 수 있다고 합니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;이제 이 과정을 수식으로 표현해보겠습니다.&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;790&quot; data-origin-height=&quot;47&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bIvHU8/btszlvfdGyO/VQjCLNz49u3LQuKGDNQuF1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bIvHU8/btszlvfdGyO/VQjCLNz49u3LQuKGDNQuF1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bIvHU8/btszlvfdGyO/VQjCLNz49u3LQuKGDNQuF1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbIvHU8%2FbtszlvfdGyO%2FVQjCLNz49u3LQuKGDNQuF1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;672&quot; height=&quot;40&quot; data-origin-width=&quot;790&quot; data-origin-height=&quot;47&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;각 윈도우 p &amp;isin; x 에서 DCT 이후에 각 주파수 대역에서 local 통계 정보를 수집하는데 이는 FAD 에서 사용하는 방법과 유사합니다. 위 수식에서 log10은 각 주파수 band 크기의 균형을 맞추기 위해 적용하였습니다. 또한, 주파수 band는 낮은 주파수에서 높은 주파수 순서로 스펙트럼을 M 개의 부분으로 균등하게 분할한 후 feature 를 수집합니다. FAD와 유사하게 &lt;i&gt;h_base&lt;/i&gt;는 base filter이고, &lt;i&gt;h_w&lt;/i&gt;는 학습 가능한 filter입니다. 그 다음 윈도우 p에 대한 local 주파수 통계 정보 q는&amp;nbsp; 1* 1*M크기의 벡터로&amp;nbsp; transpose됩니다. 모든 윈도우에 대해 이 주파수 정보를 수집하여 벡터화 하고 결과로 채널수가 M인 벡터가 만들어지게 됩니다. 이 벡터는 이후 CNN(본 논문에서는 xception)의 입력으로 들어가게 됩니다. 본 논문에서는 window 크기는 10으로 , stride는 2, 밴드 수 M은&amp;nbsp; 6으로 설정했다고 합니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.3 Two-stream Collaborative Learning Framework&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;861&quot; data-origin-height=&quot;301&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bv4lvc/btszlYuJH2I/icQFGsHI701vk9VxiX9zz1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bv4lvc/btszlYuJH2I/icQFGsHI701vk9VxiX9zz1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bv4lvc/btszlYuJH2I/icQFGsHI701vk9VxiX9zz1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbv4lvc%2FbtszlYuJH2I%2FicQFGsHI701vk9VxiX9zz1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;731&quot; height=&quot;256&quot; data-origin-width=&quot;861&quot; data-origin-height=&quot;301&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;앞서 설명한&amp;nbsp; FAD와 LFS는 서로 다르지만 보완적이라고 본 논문에서는 주장합니다. 그렇기 때문에, FAD와 LFS를 fuse 하는 cross-attention기반의 collaborative learning framework를 제안합니다.&amp;nbsp; FAD와 LFS에서는 각각 backbone으로 Xception을 사용하고 있는데 각각의 블럭에서 feature interation과 message passing을 위해 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;cross-attention을 사용합니다. 이전의 방법들은 단순히 서로 다른 두 feature를 concat하는 방법을 사용했지만 제안하는 모델에서는 먼저 FAD와 LFS branch에서 나온 feature map으로 cross-attention weight를 계산합니다. cross-attention matrix는 서로 다른 stream들 간의 feature가 강화될 수 있도록 하는 이점이 있다고 합니다. 본 논문에서는 mid-level 과 high-level의 feature를 모두 사용하기 위해 Xception의 7번째와 12번쨰 블럭의 feature를 fuse합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;모델은 end-to-end방식으로, cross entropy loss를 사용해 학습됩니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9; font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;b&gt;3. Experiments&amp;nbsp;&amp;nbsp;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3.1 다른 모델들과의 비교&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;먼저 FaceForensics++ dataset에 대한 실험 결과입니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1199&quot; data-origin-height=&quot;599&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bII7Bi/btszwcE1t7y/g66UDznpt8Jqu20famvTq1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bII7Bi/btszwcE1t7y/g66UDznpt8Jqu20famvTq1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bII7Bi/btszwcE1t7y/g66UDznpt8Jqu20famvTq1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbII7Bi%2FbtszwcE1t7y%2Fg66UDznpt8Jqu20famvTq1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;642&quot; height=&quot;321&quot; data-origin-width=&quot;1199&quot; data-origin-height=&quot;599&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;3.2 Ablation study&amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;FAD, LFS, MixBlock의 사용에 따른 성능 변화입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1091&quot; data-origin-height=&quot;394&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Kip0d/btszvGTNwIZ/aip0xl7V4NpkZNCwgHjuu1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Kip0d/btszvGTNwIZ/aip0xl7V4NpkZNCwgHjuu1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Kip0d/btszvGTNwIZ/aip0xl7V4NpkZNCwgHjuu1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKip0d%2FbtszvGTNwIZ%2Faip0xl7V4NpkZNCwgHjuu1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;658&quot; height=&quot;238&quot; data-origin-width=&quot;1091&quot; data-origin-height=&quot;394&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;다음은 FAD에서 필터의 사용에 따른 성능 변화입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;1116&quot; data-origin-height=&quot;156&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bVLqmI/btszsA7TiGc/dKGqEtHUV8zStyggcWUs00/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bVLqmI/btszsA7TiGc/dKGqEtHUV8zStyggcWUs00/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bVLqmI/btszsA7TiGc/dKGqEtHUV8zStyggcWUs00/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbVLqmI%2FbtszsA7TiGc%2FdKGqEtHUV8zStyggcWUs00%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;1116&quot; height=&quot;156&quot; data-origin-width=&quot;1116&quot; data-origin-height=&quot;156&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/Fake Detection</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/152</guid>
      <comments>https://ga02-ailab.tistory.com/152#entry152comment</comments>
      <pubDate>Tue, 31 Oct 2023 17:11:40 +0900</pubDate>
    </item>
    <item>
      <title>[8] MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE,AND MOBILE-FRIENDLY VISION TRANSFORMER</title>
      <link>https://ga02-ailab.tistory.com/151</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Paper] &lt;a href=&quot;https://arxiv.org/pdf/2110.02178.pdf&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://arxiv.org/pdf/2110.02178.pdf&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;[Github] &lt;a href=&quot;https://github.com/apple/ml-cvnets&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://github.com/apple/ml-cvnets&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1696945309644&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;object&quot; data-og-title=&quot;GitHub - apple/ml-cvnets: CVNets: A library for training computer vision networks&quot; data-og-description=&quot;CVNets: A library for training computer vision networks - GitHub - apple/ml-cvnets: CVNets: A library for training computer vision networks&quot; data-og-host=&quot;github.com&quot; data-og-source-url=&quot;https://github.com/apple/ml-cvnets&quot; data-og-url=&quot;https://github.com/apple/ml-cvnets&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/6IyY3/hyT9LOoebz/Vf6Lgym4wtSNsqp29tiWJ0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600&quot;&gt;&lt;a href=&quot;https://github.com/apple/ml-cvnets&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://github.com/apple/ml-cvnets&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/6IyY3/hyT9LOoebz/Vf6Lgym4wtSNsqp29tiWJ0/img.png?width=1200&amp;amp;height=600&amp;amp;face=0_0_1200_600');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;GitHub - apple/ml-cvnets: CVNets: A library for training computer vision networks&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;CVNets: A library for training computer vision networks - GitHub - apple/ml-cvnets: CVNets: A library for training computer vision networks&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;github.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;이번 포스팅에서는 mobileViT 논문에 대해 리뷰하겠습니다 :)&amp;nbsp;&lt;br /&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Abstract&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;비전 분야에서 많이 쓰이는 대표적인 신경망 중 하나가 바로 CNN이죠! CNN은 spatial inductive bias가 있기 때문에 다양한 비전 task에 더 적은&amp;nbsp; 파라미터로 학습할 수 있습니다. 하지마 CNN의 단점 중에 하나가 feature들이 locally하다는 점입니다. global한 정보를 학습하기 위해서 self-attention 기반의 Vision Transformer(ViT)가 많이 사용되고 있는데 CNN에 비하면 ViT의 파라미터 수는 매우 많습니다. 따라서 본 논문에서는 이런 점들을 해결하기 위해!&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;mobile vision task를 위해 가볍고 latency가 짧으며 CNN과&amp;nbsp; ViT의 장점만을 결합한 신경망인 MobileViT를 설계했습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 논문에서 제안하는 MobileViT는 global 정보를 처리하는 데 있어 일반적인 Transformer와 다른 관점을 제시합니다. 그 결과 MobileViT가 CNN 및 ViT 보다 훨씬 뛰어난 성능을 보인다는 것을 다양한 실험을 통해 입증했습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;color: #000000; background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;1. Introduction&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;본 논문에서 제안하는 MobileViT는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;소스가 제한된 모바일 기기에서 vision task를 실행할 수 있도록 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;ViT를 &quot;가볍고, 빠르게&quot; 만든 모델입니다. &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;Transformer&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;계열의 신경망이 성능을 향상시키기 위해 선택한 방법은 파라미터 수를 증가시키는 것이었습니다. 하지만 이 방법은 모델 사이즈와 latency를 매우 증가시킨다는 단점이 있습니다.&lt;span&gt;&amp;nbsp;&lt;/span&gt;&lt;/span&gt; 최근까지도 비전분야에서 CNN이 활발히 사용되고 있는 만큼 CNN의 장점도 놓칠 수 없기 때문에 MobileViT는 CNN과 ViT의 장점을 결합하여 이러한 단점을 개선할 수 있도록 설계하였습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;- CNN: Spatial inductive bias를 가지며, data augmentation에 덜 민감합니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;- Transformer : 모델이 무거우며, 최적화가 어렵습니다. 부족한&amp;nbsp;&amp;nbsp; &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;inductive bias문제와 over-fitting을 방지하기 위해 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&amp;nbsp;&lt;/span&gt;data augmentation이 필수적입니다. 또한, down stream task를 위해 무거운 decoder가 필수적입니다. 하지만&lt;/span&gt; global information을 processing 할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT는 장점들만 취해 locally하고 global한 정보 모두 encoding 할 수 있고, light weight와 low latency 모델입니다. 또한 성능면에서도 CNN보다 높은 성능을 달성했다고 합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그렇다면 어떻게 CNN과 Transformer를 결합했을까요? 아주 간단하게만 살펴보겠습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;일반적인 CNN은unfolding -&amp;gt; local processing -&amp;gt; folding의 과정을 거치치만 MobileViT는 local processing부분을 transformer를 이용한 global processing으로 대체했다고 합니다. 때문에 이 부분은 CNN과 ViT의 특성을 모두 갖게되고 적은 파라미터수와 간단한 training recipe만으로 높은 성능을 달성할 수 있었습니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2. MobileViT: A Light-Weight Transformer&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;논문에서는 MobileViT의 구조에 대해 설명하기 전에 ViT에 대해 간단히 설명하고 있습니다. ViT에 대한 설명은 아래 글을 참고하면 좋습니다 :)&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;VIT:&lt;/span&gt; &lt;a href=&quot;https://ga02-ailab.tistory.com/147&quot;&gt;https://ga02-ailab.tistory.com/147&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.1 MobileVit Architecture&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;795&quot; data-origin-height=&quot;300&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bvU2ko/btsydVyXpN0/sNVzTCWA5g9h0P7dPWHAW0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bvU2ko/btsydVyXpN0/sNVzTCWA5g9h0P7dPWHAW0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bvU2ko/btsydVyXpN0/sNVzTCWA5g9h0P7dPWHAW0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbvU2ko%2FbtsydVyXpN0%2FsNVzTCWA5g9h0P7dPWHAW0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;795&quot; height=&quot;300&quot; data-origin-width=&quot;795&quot; data-origin-height=&quot;300&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.1.1 MobileViT block&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT block의 구조는 위 그림과 같습니다. 이 부분의 역할은 적은 파라미터로 local, global 정보를 배울 수 있도록 하는 것입니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;먼저 input tensor 를 n*n convolution과 point-wise(1*1) convolution을 거치게 해 local spatial 정보를 학습합니다. 이렇게 해서 나온 tensor를 X_L 이라고 하겠습니다. 이때, long-range non-local dependency를 가지도록 학습됩니다. long-range란 거리가 먼 image patch들 끼리도 정보를 주고 받을 수 있도록 하는 것을 말합니다. 이를 위해 가장 많이 쓰이는 방법은 dilated-convolution입니다. 하지만 이 방법은 dilation rate를 잘 정해줘야 한다는 단점이 있습니다. 또 다른 방법이 바로 self-attention입니다. multi-head self-attention은 이미 그 효과가 여러차례 입증되었지만 반복적으로 설명했듯이 heavy-weight와 sub-standard optimizability 문제가 있습니다. 바로 이 부분 때문에 ViT가 spatial inductive bias가 부족한 것이죠.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT가 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;spatial inductive bias와 함께 global representatioin을 배우도록 하기 위해 &amp;nbsp;X_L을 unfold 시켜 이미지와 동일하게 3채널의 non-overlapping flattend patch인 X_U를 만듭니다.&amp;nbsp; 이때, &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;X_U는 아래와 같이 표현됩니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;144&quot; data-origin-height=&quot;25&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/E4RY1/btsynDDEYv3/k1QkIAbUJTlMQsriXtben1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/E4RY1/btsynDDEYv3/k1QkIAbUJTlMQsriXtben1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/E4RY1/btsynDDEYv3/k1QkIAbUJTlMQsriXtben1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FE4RY1%2FbtsynDDEYv3%2Fk1QkIAbUJTlMQsriXtben1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;144&quot; height=&quot;25&quot; data-origin-width=&quot;144&quot; data-origin-height=&quot;25&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;N=HW/P(patch 수)&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;P=wh(각 patch의 dimension)&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;patch들 사이의 relationship은 여러 개의 transformer를 거쳐 encoding 되게 되고, 이렇게 global information을 학습합니다. encoding된 output은 X_G로 표현되고 이 과정들을 수식으로 표현하면 아래와 같습니다. &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;374&quot; data-origin-height=&quot;30&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cko3dZ/btsymDw2sOT/cMtuRkLIN6Jc3IFtrZUBkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cko3dZ/btsymDw2sOT/cMtuRkLIN6Jc3IFtrZUBkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cko3dZ/btsymDw2sOT/cMtuRkLIN6Jc3IFtrZUBkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fcko3dZ%2FbtsymDw2sOT%2FcMtuRkLIN6Jc3IFtrZUBkk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;374&quot; height=&quot;30&quot; data-origin-width=&quot;374&quot; data-origin-height=&quot;30&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;X_U는 n*n크기의 local information 을 가지고 있는 patch들입니다. 최종 output인 X_G는 서로 다른 patch들 간의 global representation을 encoding함으로 X_G의 각 pixel은 모든 pixel의 정보를 encoding한다고 말할수 있습니다. 그림으로 살펴보겠습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;148&quot; data-origin-height=&quot;148&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/U1Rdk/btsydUUhBAa/EMkwkgIxI2edGovTAhoXK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/U1Rdk/btsydUUhBAa/EMkwkgIxI2edGovTAhoXK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/U1Rdk/btsydUUhBAa/EMkwkgIxI2edGovTAhoXK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FU1Rdk%2FbtsydUUhBAa%2FEMkwkgIxI2edGovTAhoXK1%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;148&quot; height=&quot;148&quot; data-origin-width=&quot;148&quot; data-origin-height=&quot;148&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 빨간 pixel은 파란 pixel들과 정보를 공유합니다.&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- n*n convolution을 통해 파란 pixel이 인접한 pixel들과 정보를 공유합니다.&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center;&quot; data-ke-size=&quot;size16&quot;&gt;&lt;i&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;=&amp;gt; 그 결과, 빨간 pixel이 입력 이미지 전체 pixel과 정보를 공유할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/i&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;위 과정들 덕분에 MobileViT는 patch의 순서와 pixel의 공간정보도 잃지 않은채로 학습 할 수 있습니다. 그리고 최종 출력이었던 X_G를 다시 fold해 X_F를 얻습니다. X_L을 local, global 정보를 포함한 채로 원래의 차원으로 projection하기 위해 point-wise convolution 을 사용하고 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT block의 입력 X와 concat한 후 n*n convolution을 통해 feature를 fuse합니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.1.2 Relationship to convolution&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;standard convolution은&amp;nbsp; unfolding -&amp;gt; local representation을 위한&amp;nbsp; matrix multiplication -&amp;gt; folding 세 단계를 거칩니다.&amp;nbsp; MobileViT는 local processing단계를 보다 심층저인 global processing을 위해 transformer layer로 대체합니다. 그 결과, MobileViT는 convolution의 속성 중 하나인 spatial bias를 가지고 , 따라서 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT block을 convolution transformer라고 말할 수 있습니다. 이러한 구조는 컴퓨터, 휴대폰 상관없이 다양한 기기에서 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT를 사용 할 수 있다는 장점이 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;2.1.3 Light-weight&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;기존의 convolution과 transformer를 결합한 신경망들은 모두 파라미터 수가 많아 무거웠습니다. MobileViT는 어떻게 light-weight를 달성 할 수 있었을까요? &lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;기존의 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; text-align: start;&quot;&gt;convolution과 transformer를 결합한 신경망을 먼저 살펴보겠습니다. 아래 그림처럼 기존의 신경망은 spatial information을 latent로 바꿉니다.&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;552&quot; data-origin-height=&quot;145&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/4xEKw/btsyrAhB1JY/FkF2JXxt3Opt0hhbwUyCkK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/4xEKw/btsyrAhB1JY/FkF2JXxt3Opt0hhbwUyCkK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/4xEKw/btsyrAhB1JY/FkF2JXxt3Opt0hhbwUyCkK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F4xEKw%2FbtsyrAhB1JY%2FFkF2JXxt3Opt0hhbwUyCkK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;552&quot; height=&quot;145&quot; data-origin-width=&quot;552&quot; data-origin-height=&quot;145&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;transformer를 적용하는 단계에서 인접 pixel을 축으로 하여 stack하고 pixel값들을 linear projection을 이용해 latent space로 보내는 embedding 연산이 image의 specific inductive bias를 잃게 하는 것입니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;반면에, MobileViT는 convolution과 transformer를 두 개의 특성을 모두 살려 global representation을 배울 수 있도록 하기 때문에 light weight가 가능하답니다.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;&lt;b&gt;2.1.4 Computational cost&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;text-align: start;&quot;&gt;MobileViT&lt;/span&gt;와 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;ViT의 게산 복잡도를 비교해보면 각각 O(N^2P d) and O(N^2d)입니다. 언뜻보면 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT가 더 복잡해보이지만 실제로는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT가 DeIT 모델보다 약 1/2배의 FLOPs를 가지고 ImageNet-1k 데이터셋에서 더 높은 정확도를 달성했습니다. 이것이 가능한 이유 역시 윗절에 설명한 것처럼 convolution과 transformer의 장점을 살려 결합했기 때문이라고 할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.1.5 MobileViT architecture&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;본 논문에서는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT&lt;span&gt; 를 3가지의 다른 타입으로 설계했습니다. 모델의 사이즈에 따라 XXS, XS, S 세 종류가 존재합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #ffc1c8;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;2.2 Multi-Scale Sampler for Training Efficiency&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;기존 ViT계열의 모델들이 사용하던 multi-scale representation 전략은 fine-tuning이었습니다. 다양한 scale 별로 네트워크를 fine-tuning합니다. 이러한 접근법은 positional encoding이 다양한 scale에 따라 interpolate 되어야 하고 모델의 성능이 이 interpolate 방식에 영향을 받을 수 받게 없게 됩니다. 즉, 최대한 다양한 scale을 입력으로 받아 학습해야 성능이 높아진다는 얘기입니다. 하지만 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT&lt;span&gt; 는 CNN과 비슷하게 동작하기 때문에 positional embedding이 필요없고 다시 말하면, fine-tuning 또한 필요없게 됩니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;하지만 multi-scale training이 다양한 CNN의 성능향상에 효과적이라는 것은 이미 분명하기 때문에 완전히 버릴 수 없어 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT&lt;span&gt; 에도 변형된 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;multi-scale training 방법을 사용합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;기존 CNN에서 사용하던 방법들은 미리 정해진 입력 사이즈 중 하나를 iteration마다 선택해 학습하는데, 이는 batch size가 가장 큰 입력 사이즈로 고정되어 버리기 때문에 작은 사이즈에서는 GPU 사용률이 떨어지는 단점이 있습니다. 그래서 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT는 다음과 같은 전략을 사용합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;먼저 아래 spatial resolution S가 정렬된 채로 주어집니다. &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;280&quot; data-origin-height=&quot;28&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bANm8h/btsyqxSs0VP/QcsPRWgCr9MRdC94aSK15K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bANm8h/btsyqxSs0VP/QcsPRWgCr9MRdC94aSK15K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bANm8h/btsyqxSs0VP/QcsPRWgCr9MRdC94aSK15K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbANm8h%2FbtsyqxSs0VP%2FQcsPRWgCr9MRdC94aSK15K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;280&quot; height=&quot;28&quot; data-origin-width=&quot;280&quot; data-origin-height=&quot;28&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;다음은 t번째 training iteratioin에서 각 GPU가 하나의 scale을 랜덤하게 선택하고 batch size는 아래 식으로 설정해줍니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;113&quot; data-origin-height=&quot;33&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ZCADS/btsys6mu2BW/L1FtPSw7Bf8ky8LON2WIBK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ZCADS/btsys6mu2BW/L1FtPSw7Bf8ky8LON2WIBK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ZCADS/btsys6mu2BW/L1FtPSw7Bf8ky8LON2WIBK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FZCADS%2Fbtsys6mu2BW%2FL1FtPSw7Bf8ky8LON2WIBK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;113&quot; height=&quot;33&quot; data-origin-width=&quot;113&quot; data-origin-height=&quot;33&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;그 결과 작은 scale에서는 batch가 커지기 때문에 빠르게 학습할 수 있게됩니다. 아래 그림은 일반적인 방법과 multi-scale sampler를 비교한 것인데, 그림 (b)에서 알 수 있듯이&amp;nbsp; &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;multi-scale sampler에서 모델 업데이트 수가 적어 전체 epoch time도 줄어드는 것을 확인할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;791&quot; data-origin-height=&quot;165&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/CdfSV/btsytLvluH3/3aEUK5Utk9ybFKb57KXBdK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/CdfSV/btsytLvluH3/3aEUK5Utk9ybFKb57KXBdK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/CdfSV/btsytLvluH3/3aEUK5Utk9ybFKb57KXBdK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FCdfSV%2FbtsytLvluH3%2F3aEUK5Utk9ybFKb57KXBdK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;791&quot; height=&quot;165&quot; data-origin-width=&quot;791&quot; data-origin-height=&quot;165&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;본 논문에서는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;multi-scale sampler 값으로 {(160, 160),(192, 192),(256, 256),(288, 288),(320, 320)}을 사용하고 있습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3. Experimental Results&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.1 CNN 모델과의 비교&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;806&quot; data-origin-height=&quot;311&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ccqh8G/btsywxiOUw5/sHgStYCwxY2K4zk5VXPR8K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ccqh8G/btsywxiOUw5/sHgStYCwxY2K4zk5VXPR8K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ccqh8G/btsywxiOUw5/sHgStYCwxY2K4zk5VXPR8K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fccqh8G%2FbtsywxiOUw5%2FsHgStYCwxY2K4zk5VXPR8K%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;565&quot; height=&quot;218&quot; data-origin-width=&quot;806&quot; data-origin-height=&quot;311&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;다른 light cnn보다 우수한 성능을 보이고 있습니다. 특히 (b)를 보면 가장 적은 파라미터수를 가지고 있음에도 가장 높은 정확도를 달성한 것을 확인할 수 있습니다. 또한 XXS, XS, S 중 가장 많은 파라미터를 가진 S도 다른 CNN들에 비하면 파라미터 수는 적고 정확도는 높습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.2 ViT 모델과의 비교&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;790&quot; data-origin-height=&quot;370&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b2fLg9/btsysouapU6/0NTMwBR1KpqvlIIpvG6OEk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b2fLg9/btsysouapU6/0NTMwBR1KpqvlIIpvG6OEk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b2fLg9/btsysouapU6/0NTMwBR1KpqvlIIpvG6OEk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb2fLg9%2FbtsysouapU6%2F0NTMwBR1KpqvlIIpvG6OEk%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;595&quot; height=&quot;279&quot; data-origin-width=&quot;790&quot; data-origin-height=&quot;370&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;역시 MobileViT가 가장 높은 성능을 보이고 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.3 Inference Speed&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;742&quot; data-origin-height=&quot;126&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/HL0kX/btsywuM9Lb5/KAVh4LGGyaWFlg4T5FUjE0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/HL0kX/btsywuM9Lb5/KAVh4LGGyaWFlg4T5FUjE0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/HL0kX/btsywuM9Lb5/KAVh4LGGyaWFlg4T5FUjE0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FHL0kX%2FbtsywuM9Lb5%2FKAVh4LGGyaWFlg4T5FUjE0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;742&quot; height=&quot;126&quot; data-origin-width=&quot;742&quot; data-origin-height=&quot;126&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;아이폰 12를 이용하여 측정한 결과입니다. MobileNet이 가장 빨랐고 transformer 계열 중에서는 MobileViT가 가장 빨랐습니다. 그런데 GPU에서는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT가 가장 느렸는데 이는 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT가 다른 모델들보다 큰 입력 사이즈(256)를 사용하기도 하고 shallow+narrow 특성을 모두 가지기 때문이라고 합니다. &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT 연산들이 하드웨어에 최적화되게 구현된다면 &lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;MobileViT의 속도는 더 빨라질 것이라고 기대해볼 수 있을 것 같습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;다음은 MobileViT 모델들 끼리의 비교입니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.4 Patch size에 따른 성능 변화&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;811&quot; data-origin-height=&quot;181&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/clMIL5/btsyryKSPdb/Kfqa33Zqds7RSBWweHr9qK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/clMIL5/btsyryKSPdb/Kfqa33Zqds7RSBWweHr9qK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/clMIL5/btsyryKSPdb/Kfqa33Zqds7RSBWweHr9qK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FclMIL5%2FbtsyryKSPdb%2FKfqa33Zqds7RSBWweHr9qK%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;735&quot; height=&quot;164&quot; data-origin-width=&quot;811&quot; data-origin-height=&quot;181&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;위 그림은 patch size에 따른 inference 속도와 정확도입니다. 같은 사이즈의 모델이라고 해도 patch size에 따라 변화가 크기때문에 값을 잘 설정해주는 것이 중요합니다.&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;3.5 CNN의 kernel size와 patch size&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;아래 그림에서 n은 kernel의 크기를, h와 w는 patch의 크기에 해당하는 값입니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-ke-mobileStyle=&quot;widthOrigin&quot; data-origin-width=&quot;785&quot; data-origin-height=&quot;236&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/svk2L/btsywuGoMHq/6lYUhZI5GcHlQJFQvGCQe0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/svk2L/btsywuGoMHq/6lYUhZI5GcHlQJFQvGCQe0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/svk2L/btsywuGoMHq/6lYUhZI5GcHlQJFQvGCQe0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fsvk2L%2FbtsywuGoMHq%2F6lYUhZI5GcHlQJFQvGCQe0%2Fimg.png&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot; loading=&quot;lazy&quot; width=&quot;622&quot; height=&quot;187&quot; data-origin-width=&quot;785&quot; data-origin-height=&quot;236&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; text-align: start;&quot;&gt;h와 w가 n보다 크면 하나의 patch에 있는 각 pixel들이 Convolution으로 부터 patch에 속한 모든 pixel 정보를 얻을 수 없기 때문에 local information representation능력이 떨어지게 됩니다. 이는 전체적인 정확도 하락으로 이어질수밖에 없습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Paper Review/etc</category>
      <category>transformer</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/151</guid>
      <comments>https://ga02-ailab.tistory.com/151#entry151comment</comments>
      <pubDate>Sun, 15 Oct 2023 21:05:31 +0900</pubDate>
    </item>
    <item>
      <title>[딥러닝 기본지식] Inductive Bias</title>
      <link>https://ga02-ailab.tistory.com/150</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;요즘 논문을 읽다보면 Inductive Bias라는 단어를 자주 볼 수 있는데요! 이번 포스팅은 Inductive Bias에 대해 작성해보겠습니다 :D&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;1. Inductive Bias란?&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;우리는 딥러닝 모델을 학습시킬 때 모델이 일반화가 잘 되게 학습되길 기대합니다. 일반화가 잘 됐다는 것은 학습 시에 보지 못한 데이터도 적절히 잘 분류함을 뜻합니다. 이 떄 Inductive Bias가 사용됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;color: #000000; font-family: 'Noto Sans Demilight', 'Noto Sans KR';&quot;&gt;Inductive Bias란 모델이 주어지지 않은 데이터의 output을 예측하는 것입니다. 모델의 일반화 능력을 높이기 위해 사용하는 추가적인 가정을 의미합니다. 모델이 한 번도 보지 못한 데이터에 대해서도 정확한 output을 내기 위해서는 추가적인 가정이 필수적입니다. 이 추가된 가정은 사전 정보를 통해 얻게 됩니다. 때문에 일반화가 잘 된 모델들은 특정한 유형의 Inductive Bias를 갖게 됩니다. 즉! Inductive Bias는 &quot;&lt;b&gt;처음보는 데이터에 대해 귀납적 추론이 가능하도록 하는 알고리즘이 가지고 있는 가정의 집합이다.&lt;/b&gt;&quot;라고 할 수 있습니다.&amp;nbsp;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;2. 여러 신경망들의 Inductive Bias&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;FCN, CNN, RNN, GNN, Transformer들의 Inductive Bias에 대해 알아보겠습니다.&lt;/span&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- &lt;b&gt;FCN&lt;/b&gt;: 가장 일반적인 형태의 신경망입니다. 신경망 내의 모든 유닛들이 서로 연결되어 있는 형태입니다. 따라서, 입력의 모든 요소들이 출력의 모든 요소에 영향을 미치기 때문에 Inductive Bias가 매우 약합니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;- &lt;b&gt;CNN&lt;/b&gt;: vision 분야에서 가장 많이 사용되는 신경이죠! 일정 크기의 filter가 이미지 전체를 훑습니다. 항상 filter 크기만한 정보만 받아들이므로 locality가 굉장히 강합니다. 이미지내에 같은 물체가 다른 위치에 존재해도 어렵지 않게 찾아낼 수 있습니다. 때문에 CNN&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;은 Locality &amp;amp; Translation Invariance&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;의 Inductive Biases를 갖습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;- &lt;b&gt;RNN&lt;/b&gt;: sequential한 정보를 처리하기 위해 설계된 신경망입니다. CNN과 유사하&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;게 &lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;&amp;nbsp;&lt;/span&gt;Sequential &amp;amp; Temporal Invariance&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;의&amp;nbsp; Inductive Biases를 갖습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;-&lt;b&gt;GNN&lt;/b&gt;: GNN 또한 유사합니다. 연결된 노드들끼리만 feature를 주고받&lt;span style=&quot;color: #000000;&quot;&gt;기 때문에 Permutational Invarianced&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #ffffff; text-align: left;&quot;&gt;의 Inductive Biases를 갖습니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #ffffff; text-align: left;&quot;&gt;- &lt;b&gt;Transformer&lt;/b&gt;: &lt;a href=&quot;https://ga02-ailab.tistory.com/147&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;Vision Transformer(ViT) 논문&lt;/a&gt;에는 다음과 같은 내용이 등장합니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style2&quot;&gt;Transformers lack some of the inductive biases inherent to CNNs, such as translation equivariance and locality, and therefore do not generalize well when trained on insufficient amounts of data.&lt;/blockquote&gt;
&lt;p&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;&lt;span style=&quot;background-color: #ffffff; text-align: left;&quot;&gt;&lt;span style=&quot;color: #000000;&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; background-color: #ffffff; text-align: left;&quot;&gt;Transformer는 CNN과 다르게 positional encoding과 self-attention을 사용합니다. 그렇기 때문에 CNN에 비해 Inductive Bias가 부족할 수 밖에 없습니다. Transformer에서는 부족한 Inductive Bias문제를 해결하기 위해 대용량 데이터에 사전학습 시키는 방법을 사용하고 있습니다. 결론은,, 글로벌한 정보가 필요하다면 Transformer를 선택하는 것이 바람직하지만, 지역적인 정보가 많아 Inductive bias를 최대한 활용하겠다면 CNN을 선택하는 것이 더 좋을 것입니다.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;div id=&quot;gtx-trans&quot; style=&quot;position: absolute; left: 102px; top: 1077.25px;&quot;&gt;
&lt;div class=&quot;gtx-trans-icon&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;/div&gt;</description>
      <category>AI Research/Deep Learning</category>
      <category>Inductive bias</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/150</guid>
      <comments>https://ga02-ailab.tistory.com/150#entry150comment</comments>
      <pubDate>Sun, 1 Oct 2023 20:51:50 +0900</pubDate>
    </item>
    <item>
      <title>runtimeerror: found dtype long but expected float</title>
      <link>https://ga02-ailab.tistory.com/149</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;background-color: #c1bef9;&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;- 전체 에러 문구&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1695131334933&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;runtimeerror: found dtype long but expected float&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;=&amp;gt; 데이터 타입이 맞지 않아 생기는 에러입니다.&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;b&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000; background-color: #c1bef9;&quot;&gt;- 해결 방법&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/b&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;.to() 함수를 사용해 데이터 타입을 변경해주면 됩니다.&lt;/span&gt;&lt;/p&gt;
&lt;pre id=&quot;code_1695131411032&quot; class=&quot;bash&quot; data-ke-language=&quot;bash&quot; data-ke-type=&quot;codeblock&quot;&gt;&lt;code&gt;my_tensor.to(torch.float32)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;</description>
      <category>Error Note</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/149</guid>
      <comments>https://ga02-ailab.tistory.com/149#entry149comment</comments>
      <pubDate>Tue, 26 Sep 2023 22:40:19 +0900</pubDate>
    </item>
    <item>
      <title>Apple Machine Learning Research 사이트</title>
      <link>https://ga02-ailab.tistory.com/148</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;a href=&quot;https://machinelearning.apple.com/research/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;https://machinelearning.apple.com/research/&lt;/a&gt;&lt;/p&gt;
&lt;figure id=&quot;og_1695130992368&quot; contenteditable=&quot;false&quot; data-ke-type=&quot;opengraph&quot; data-ke-align=&quot;alignCenter&quot; data-og-type=&quot;website&quot; data-og-title=&quot;Research&quot; data-og-description=&quot;Explore advancements in state of the art machine learning research in speech and natural language, privacy, computer vision, health, and more.&quot; data-og-host=&quot;machinelearning.apple.com&quot; data-og-source-url=&quot;https://machinelearning.apple.com/research/&quot; data-og-url=&quot;https://machinelearning.apple.com/research&quot; data-og-image=&quot;https://scrap.kakaocdn.net/dn/bbdqsG/hyTVZ04GEs/1Q4UMkkgYCKzwvyphWDpWk/img.png?width=1200&amp;amp;height=630&amp;amp;face=0_0_1200_630&quot;&gt;&lt;a href=&quot;https://machinelearning.apple.com/research/&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot; data-source-url=&quot;https://machinelearning.apple.com/research/&quot;&gt;
&lt;div class=&quot;og-image&quot; style=&quot;background-image: url('https://scrap.kakaocdn.net/dn/bbdqsG/hyTVZ04GEs/1Q4UMkkgYCKzwvyphWDpWk/img.png?width=1200&amp;amp;height=630&amp;amp;face=0_0_1200_630');&quot;&gt;&amp;nbsp;&lt;/div&gt;
&lt;div class=&quot;og-text&quot;&gt;
&lt;p class=&quot;og-title&quot; data-ke-size=&quot;size16&quot;&gt;Research&lt;/p&gt;
&lt;p class=&quot;og-desc&quot; data-ke-size=&quot;size16&quot;&gt;Explore advancements in state of the art machine learning research in speech and natural language, privacy, computer vision, health, and more.&lt;/p&gt;
&lt;p class=&quot;og-host&quot; data-ke-size=&quot;size16&quot;&gt;machinelearning.apple.com&lt;/p&gt;
&lt;/div&gt;
&lt;/a&gt;&lt;/figure&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Sans Demilight', 'Noto Sans KR'; color: #000000;&quot;&gt;Apple에서 투고한 논문들을 분야별, 학회별, 연도별로 볼 수 있는 사이트입니다.&lt;/span&gt;&lt;/p&gt;</description>
      <category>My Study/Project</category>
      <author>ga.0_0.ga</author>
      <guid isPermaLink="true">https://ga02-ailab.tistory.com/148</guid>
      <comments>https://ga02-ailab.tistory.com/148#entry148comment</comments>
      <pubDate>Tue, 19 Sep 2023 22:47:42 +0900</pubDate>
    </item>
  </channel>
</rss>