Görüntü sınıflandırmada yineleyen derin ağ ve görü dönüştürücü modellerinin karşılaştırılması = Comparison of recurrent deep network and vision transformer models in image classification

Bubo, Oğuzhan

DSpace Home
→
Enstitüler / Institutes
→
Fen Bilimleri Enstitüsü / Instıtute of Scıence and Technology
→
Tez Koleksiyonu
→
2023 Yüksek Lisans Tezleri Koleksiyonu
→
View Item

dc.contributor.advisor	Doktor Öğretim Üyesi Burhan Baraklı
dc.date.accessioned	2024-01-26T12:23:09Z
dc.date.available	2024-01-26T12:23:09Z
dc.date.issued	2023
dc.identifier.citation	Bubo, Oğuzhan. (2023). Görüntü sınıflandırmada yineleyen derin ağ ve görü dönüştürücü modellerinin karşılaştırılması = Comparison of recurrent deep network and vision transformer models in image classification. (Yayınlanmamış Yüksek Lisans Tezi). Sakarya Üniversitesi Fen Bilimleri Enstitüsü
dc.identifier.uri	https://hdl.handle.net/20.500.12619/101792
dc.description	06.03.2018 tarihli ve 30352 sayılı Resmi Gazetede yayımlanan “Yükseköğretim Kanunu İle Bazı Kanun Ve Kanun Hükmünde Kararnamelerde Değişiklik Yapılması Hakkında Kanun” ile 18.06.2018 tarihli “Lisansüstü Tezlerin Elektronik Ortamda Toplanması, Düzenlenmesi ve Erişime Açılmasına İlişkin Yönerge” gereğince tam metin erişime açılmıştır.
dc.description.abstract	Görüntü sınıflandırma, görüntülerde veya videolarda bulunan belli nesneleri algılayıp, görüntü işleme alanında kullanan bir teknoloji aracıdır. Bu işlem, aynı zamanda bir dizi sınıflandırma sürecinin bir parçasıdır. Makineler aracılığıyla yapılan bu sınıflandırma işlemlerinde uzun zamandır çeşitli teoriler ve yöntemler ortaya atılmış ve uygulanmıştır. Geometrik, bölgesel tabanlı, mantıksal nitelikli, dikkat tabanlı çok sayıda modelleme çeşitli uygulamalarda sıklıkla kullanılmaktadır. İhtiyaç ve gereksinimlere göre çeşitli modeller ortaya atılmaya devam etmektedir. Son yıllarda, sinir ağları ile görüntü sınıflandırma çalışmaları gitgide artmaktadır. Bu çalışmada yineleyen sinir ağları, pekiştirmeli öğrenme, görü dönüştürücü kullanılarak görüntü sınıflandırma işleminin gerçekleştirilmesi hedeflenmektedir. Çalışmada dikkat tabanlı iki model olan, yineleyen dikkat modeli ile görü dönüştürücü aracılığıyla görüntü sınıflandırma işleminin gerçekleştirilmesi ve her iki dikkat modelinin karşılaştırılması ana hedeftir. Sinir ağı, sunulan nesneleri önceki öğrendikleriyle karşılaştırarak aradaki ilişkiyi tespit edip, nesneleri sınıflandırmaktadır. Yineleyen dikkat modelleri, son dönemlerde farklı tanıma ve sınıflandırma çalışmalarında üstün başarı göstermiştir. Çalışmada, yineleyen görsel dikkat modeli sınırlı bir sensör vasıtasıyla görsel bir ortamla ilişkili, hedef odaklı bir araçtır. Model, pekiştirmeli öğrenme ile görsel bir çevreyle ilişkiye giren hedefe yönelik bir ajanın birbirini izleyen karar verme prosesidir. Yineleyen dikkat modelinin çalışma prensibi, görüntünün sadece ilgilenilen bölümüne odaklanmaya dayanmaktadır. Bu sayede zamandan tasarruf sağlanarak, ilgilenilen piksel sayısında düşüş oluşur. Pekiştirmeli öğrenmedeki ajan, eylemleri yürütüp, ortamın reel gidişatını da etkiliyebilir. Çevre sınırlı gözlemlenebildiğinden, ajanın nasıl bir yol izleyeceğini ve sensörünü en etkin bir biçimde nasıl yerleştireceğini tespit etmek için bilgiyi zamanla entegre etmesi gerekmektedir. Her adımda, ajan bir ödül almaktadır. Ajanın ana amacıda, ödüllerin maksimuma çıkarılmasıdır. Ödüllerin maksimum olduğu haller için derin ağ modeli oluşturulmuş olmaktadır. Karşılaştırmada kullanılan bir diğer dikkat tabanlı model ise görü dönüştürücüdür. Görü dönüştürücü, yineleyen dikkat modeline göre daha yeni ortaya çıkmış bir modeldir. Dönüştürücü sinir ağları, ilk olarak NLP adı verilen dil işleme çalışmalarında sıklıkla kullanılmıştır. Bu tip çalışmalarda otomatik çeviri ile insan çevirisi arasındaki değişikliğin ölçüldüğü BLEU puanı adı verilen sistem kullanılmaktadır ve başarılı sonuçlar elde edilmektedir. Bu çalışmaların sonrasında, görü dönüştürücüler çeşitli sınıflandırma uygulamalarında da kullanılmaya başlamış ve harcanan zaman ve doğruluk açısından iyi sonuçlar elde edilmiştir. Görü dönüştürücü modelinde önce giriş görüntüntüleri sabit boyutlu parçalara ayrılarak, düzleştirilip bir vektör elde edilir. Bu vektörlerden doğrusal gömme dizileri oluşturulur. Oluşturulan bu gömme dizileri transformatör kodlayıcıya giriş olur. Modelde, çok başlı öz dikkat katmanı, çok katmanlı algılayıcı ve norm katmanları bulunmaktadır. Modelde, kişisel dikkat katmanı görüntü bilgilerinin gömülmesi ve yeniden yapılandırma görevlerine sahiptir. Yineleyen sinir ağları önemli iki probleme sahiptir. Bunlar; zaman olarak eğitimin uzunluğu ve kaybolan gradyan problemidir. Eğitim süresinin uzunluğu, güçlü GPU ve CPU'lar ile çözüme kavuşturulabilir. Ancak kaybolan gradyan problemi daha önemli bir sorundur. Geçmişten gelen bilgileri kullanmada uzun vadede yineleyen sinir ağlarında sıkıntılar oluşmaktadır. Bu kısmen uzun kısa süreli bellek aracılığıyla çözülmeye çalışılsada optimum düzeyde performans sağlayamamaktadır. Uzun kısa süreli bellek geçmişten gelen bilginin kullanımında yineleyen ağlara göre çok daha başarılı davranabilmektedir. Dönüştürücü sinir ağları kullanıldığında, bilgiler gömülerek belirlenebilir. Görü dönüştürücü dikkat tabanlı bir model olmakla beraber, bu gömülü bilgiler saklanarak, değişik ağırlık değerleri ile öncelik sırası belirlenerek kullanılmaktadır. Sınıflandırma uygulamalarında, sunulan nesnenin üstün bir başarıyla tespiti için kullanılan veri seti belli aşamalardan geçmelidir. Çalışma, PyTorch üzerinde gerçekleştirilmiştir. Bu kütüphanenin erişim sağlayabildiği datasetlerinden Fashion-MNIST ve CIFAR-10 datasetleri her iki dikkat tabanlı modeller üzerinde test edilerek, çıkan sonuçlar karşılaştırılmıştır. CIFAR-10 dataset çeşitli ulaşım araçları ve hayvanlar olmak üzere on farklı sınıftan oluşmaktadır. Görüntü boyutu 32x32'dir ve 60000 görüntüden oluşmaktadır. Fashion-MNIST dataseti ise on farklı giyim ürünü türünden oluşan ve 70000 görüntüye sahip bir datasettir. Deney kısmında, google colab'in kullanıcılarına sunduğu Tesla T4 GPU kullanılmıştır. İlk olarak Fashion-MNIST datasetinin yineleyen dikkat modeli üzerinde eğitimi gerçekleştirilmiştir. Modelin yüksek duyarlı olduğu patch boyutu, bakış sayısı, iterasyon sayısı, öğrenme oranı gibi parametreleri değiştirilerek tepkileri ölçülmüş ve birçok eğitim gerçekleştirilmiştir. Görüntü boyutları göz önüne alınarak patch boyutunun sekiz olması uygun görülmüştür. Bakış sayısı yükseltildiğinde nispeten ufak artışlar olmasına rağmen eğitim süresinin ciddi bir oranda arttığı gözlemlenmiştir. Bu nedenle altı olarak uygun görülmüştür. İterasyon sayısı 300 olarak alınsada yeterli gelişme görülmediğinden eğitim 256. iterasyonda sonlandırılmıştır. Öğrenme oranı zamanla azalan biçimdedir. Eğitim sonuç grafikleri incelendiğinde, eğitim ve test kayıp oranları azalırken; eğitim ve test doğruluk oranları ise artmaktadır. Daha sonrasında ise aynı işlemler CIFAR-10 dataseti üzerinde gerçekleştirilmiştir. Görüntü boyutuna göre patch boyutu on iki olarak alınmıştır. Bakış sayısı sekize yükseltilmiştir. Fashion-MNIST datasetindeki iterasyon ve öğrenme oranı ise sabit kalmıştır. Kayıp ve doğruluk oranları bir önceki datasetle benzer çizgidedir. Bir sonraki aşamada görü dönüştürücü modelinin Fashion-MNIST ve CIFAR-10 datasetleri üzerinde eğitimi gerçekleştirilmiştir. İlk olarak, Fashion-MNIST datasetinde, görü dönüştürücü modelin duyarlı olduğu patch boyutu, derinlik, MLP boyutu gibi parametreleri değiştirirek çeşitli eğitimler gerçekleştirilmiştir ve sonuçlar incelenmiştir. Seçilen parametre değerleri tablo halinde verilmiştir. İterasyon sayısında kırk esas alınmıştır, çünkü bu değer sonrasında kaybın yükseldiği gözlemlenmiştir. Test ve eğitim kayıp oranları azalırken, doğruluk oranları ise artmaktadır. Daha sonrasında benzer işlemler CIFAR-10 dataset üzerinde de gerçekleştirilmiştir. Çeşitli denemelerden sonra optimum parametre değerleri belirlenmiştir. Kayıp ve doğruluk oranları bir önceki datasetle benzer çizgidedir. Fashion-MNIST datasetinin yineleyen dikkat modeli üzerinde gerçekleştirilen eğitimi sonucu, eğitim ve test kayıp oranları azalırken; eğitim ve test doğruluk oranları ise artmıştır. Bu eğitim süreci 256. iterasyonda sonuçlanmıştır. Bunun akabinde, Fashion-MNIST datasetinin görü dönüştürücü modeli üzerinde eğitimi sonucunda, yine aynı şekilde eğitim ve test kayıp oranları azalırken; eğitim ve test doğruluk oranları ise artmıştır. Görü dönüştürücü modelinde kırk iterasyon gibi bir süreçte, yineleyen dikkat modeli ile gerçekleştirilen eğitim sonucunda ulaşılan değerlere yaklaşmıştır. Fashion-MNIST datasetinin yineleyen dikkat modeli ile görü dönüştürücü modeli arasında zamansal bir kıyaslama yapıldığında, görü dönüştürücü modelinin yineleyen dikkat modeline göre daha avantajlı olduğu görülmektedir. CIFAR-10 datasetinin yineleyen dikkat modeli üzerinde gerçekleştirilen eğitimi sonucu, eğitim ve test kayıp oranları azalırken; eğitim ve test doğruluk oranları ise artmıştır. Bu eğitim süreci 293. iterasyonda sonuçlanmıştır. Bunun akabinde, CIFAR-10 datasetinin görü dönüştürücü modeli üzerinde eğitimi sonucunda, yine aynı şekilde eğitim ve test kayıp oranları azalırken; eğitim ve test doğruluk oranları ise artmıştır. Görü dönüştürücü modelinde 200 iterasyon gibi bir süreçte, yineleyen dikkat modeli ile gerçekleştirilen eğitim sonucunda ulaşılan değerlerden çok daha iyi sonuçlar elde edilmiştir. CIFAR-10 datasetinin yineleyen dikkat modeli ile görü dönüştürücü modeli arasında zamansal bir kıyaslama yapıldığında, görü dönüştürücü modelinin yineleyen dikkat modeline göre daha avantajlı olduğu görülmektedir. Sonuç olarak deney bölümünde, her iki dataset eğitim sonuçlarına ait grafik ve tablolar sunulmuştur. Elde edilen sonuçlara göre görü dönüştürücü modeli, yineleyen dikkat modeline oranla eğitim süresi bakımından daha hızlı sonuçlanırken, elde edilen doğruluk ve kayıp oranlarında da daha iyi sonuçlar verdiği gözlemlenmiştir. Önceden eğitilmiş verisetleri kullanıldığında daha iyi sonuçlar elde edileceği öngörülmektedir.
dc.description.abstract	Image classification is a technology tool that detects certain image in images or videos and uses them in image processing. This process is also part of a series of classification processes. Various theories and methods have been put forward and applied for a long time in these classification processes made by machines. A large number of geometric, regional-based, logical, attention-based models are frequently used in various applications. Various models continue to be introduced according to needs and requirements. In recent years, image classification studies with neural networks have been increasing. In this study, it is aimed to perform image classification by using recurrent neural networks, reinforcement learning, and vision transformer. The main objective of the study is to perform the image classification process through the recurrent attention model and the vision transformer, which are two attention-based models, and to compare both attention models. The neural network compares the presented images with what it has learned before, detects the relationship between them and classifies the images. Recurrent attention models have shown superior success in different recognition and classification studies in recent years. In the study, the recurrent visual attention model is a goal-oriented tool associated with a visual environment through a limited sensor. The model is the sequential decision-making process of a goal-directed agent that engages with a visual environment through reinforcement learning. The working principle of the recurrent attention model is based on focusing only on the part of the image of interest. This saves time and reduces the number of pixels of interest. The agent in reinforcement learning can also carry out the actions and affect the real course of the environment. Since the environment is limitedly observable, the agent needs to integrate the information over time to determine what path to take and how to deploy its sensor most effectively. At each step, the agent receives a reward. The main purpose of the agent is to maximize the rewards. For the cases where the rewards are maximum, the deep network model is created. Another attention-based model used in comparison is the vision transformer. The vision transformer is a more recent model than the recurrent attention model. Transformer neural networks were first used frequently in language processing studies called NLP. In such studies, a system called BLEU score, which measures the change between automatic translation and human translation, is used and successful results are obtained. After these studies, vision transformer started to be used in various classification applications and good results were obtained in terms of time and accuracy. In the vision transformer model, the input images are first divided into fixed-size pieces, flattened and a vector is obtained. Linear embedding sequences are created from these vectors. These built-in sequences are input to the transformer encoder. In the model, there are multi-headed self-attention layer, multi-layer perceptron and norm layers. In the model, the personal attention layer has the tasks of embedding and reconstructing image information. Recurrent neural networks have two important problems. These; is the length of the training in time and the vanishing gradient problem. The length of training time can be solved with powerful GPUs and CPUs. But the vanishing gradient problem is a more important one. Problems occur in recurrent neural networks in the long term in using information from the past. Although this is partially solved through long short-term memory, it cannot provide optimum performance. Long short-term memory can act much more successfully than recurrent networks in the use of information from the past. When using transformer neural networks, information can be determined by embedding. Although the vision transformer is an attention-based model, this embedded information is stored and used by determining the priority order with different weight values. In image classification applications, the data set used to detect the presented image with superior success must go through certain stages. The study was carried out in the PyTorch library. Fashion-MNIST and CIFAR-10 datasets, which this library can access, were tested on both attention-based models and the results were compared. The CIFAR-10 dataset consists of ten different classes, including various transportation vehicles and animals. The image size is 32x32 and consists of 60000 images. The Fashion-MNIST dataset, on the other hand, is a dataset consisting of ten different clothing product types and has 70000 images. In the experiment part, Tesla T4 GPU offered by google colab to its users was used. First, training on the recurrent attention model of the Fashion-MNIST dataset was carried out. By changing the parameters such as patch size, number of glimpses, number of iterations, learning rate, which the model is highly sensitive to, its responses were measured and many trainings were carried out. Considering the image dimensions, it was deemed appropriate to have a patch size of eight. Although there were relatively small increases when the number of views was increased, it was observed that the training time increased significantly. Therefore, six was considered appropriate. Although the number of iterations was taken as 300, the training was terminated at the 256th iteration, since there was not enough improvement. The learning rate is in a decreasing form over time. When the training result graphs are examined, while training and test loss rates are decreasing; training and test accuracy rates are increasing. Afterwards, the same operations were performed on the CIFAR-10 dataset. According to the image size, the patch size was taken as twelve. The number of views has been increased to eight. The iteration and learning rate in the Fashion-MNIST dataset remained constant. Loss and accuracy rates are in line with the previous dataset. In the next step, training of the vision transformer model was carried out on Fashion-MNIST and CIFAR-10 datasets. First, in the Fashion-MNIST dataset, various trainings were carried out by changing the parameters such as patch size, depth, MLP size that the vision transformer model is sensitive to, and the results were examined. Selected parameter values are given in the table. The number of iterations was based on forty, because it was observed that the loss increased after this value. While test and training loss rates are decreasing, accuracy rates are increasing. Later, similar operations were performed on the CIFAR-10 dataset. Optimum parameter values were determined after various trials. Loss and accuracy rates are in line with the previous dataset. As a result of the training of the Fashion-MNIST dataset on the recurrent attention model, the training and test loss rates decreased; training and test accuracy rates have increased. This training process is concluded in the 256th iteration. Subsequently, as a result of training the Fashion-MNIST dataset on the vision transformer model, the training and test loss rates decreased; training and test accuracy rates have increased. In a process such as forty iterations in the vision transformer model, it approached the values reached as a result of the training carried out with the recurrent attention model. When a temporal comparison is made between the recurrent attention model and the vision transformer model of the Fashion-MNIST dataset, it is seen that the vision transformer model is more advantageous than the recurrent attention model. As a result of the training of the CIFAR-10 dataset on the recurrent attention model, the training and test loss rates decreased; training and test accuracy rates have increased. This training process concluded in the 293rd iteration. Subsequently, as a result of the training of the CIFAR-10 dataset on the vision transformer model, the training and test loss rates decreased; training and test accuracy rates have increased. In the vision transformer model, in a process such as 200 iterations, much better results were obtained than the values reached as a result of the training carried out with the recurrent attention model. When a temporal comparison is made between the recurrent attention model and the vision transformer model of the CIFAR-10 dataset, it is seen that the vision transformer model is more advantageous than the recurrent attention model. As a result, graphics and tables of both dataset training results are presented in the experiment section. According to the results obtained, it has been observed that the vision transformer model results faster in terms of training time compared to the recurrent attention model, while it gives better results in terms of accuracy and loss rates. It is predicted that better results will be obtained when pre-trained datasets are used.
dc.format.extent	xxviii, 74 yaprak : şekil, tablo ; 30 cm.
dc.language	Türkçe
dc.language.iso	tur
dc.publisher	Sakarya Üniversitesi
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/
dc.rights.uri	info:eu-repo/semantics/openAccess
dc.subject	Elektrik ve Elektronik Mühendisliği,
dc.subject	Electrical and Electronics Engineering
dc.title	Görüntü sınıflandırmada yineleyen derin ağ ve görü dönüştürücü modellerinin karşılaştırılması = Comparison of recurrent deep network and vision transformer models in image classification
dc.type	masterThesis
dc.contributor.department	Sakarya Üniversitesi, Fen Bilimleri Enstitüsü, Elektrik-Elektronik Mühendisliği Ana Bilim Dalı, Elektrik Mühendisliği Bilim Dalı
dc.contributor.author	Bubo, Oğuzhan
dc.relation.publicationcategory	TEZ